-
公开(公告)号:US20140067368A1
公开(公告)日:2014-03-06
申请号:US13597277
申请日:2012-08-29
申请人: Wen-tau Yih , Geoffrey G. Zweig , John C. Platt
发明人: Wen-tau Yih , Geoffrey G. Zweig , John C. Platt
IPC分类号: G06F17/27
CPC分类号: G06F17/2795 , G06F16/3338 , G06F17/2785
摘要: A document-term matrix may be generated based on a corpus. A term representation matrix may be generated based on modifying a plurality of elements of the document-term matrix based on antonym information included in the corpus. Similarities may be determined based on a plurality of elements of the term representation matrix.
摘要翻译: 可以基于语料库生成文档术语矩阵。 可以基于基于语料库中包含的反义词信息修改文档项矩阵的多个元素来生成术语表示矩阵。 可以基于术语表示矩阵的多个元素来确定相似度。
-
2.
公开(公告)号:US20120323968A1
公开(公告)日:2012-12-20
申请号:US13160485
申请日:2011-06-14
IPC分类号: G06F17/30
CPC分类号: G06F16/31
摘要: A model for mapping the raw text representation of a text object to a vector space is disclosed. A function is defined for computing a similarity score given two output vectors. A loss function is defined for computing an error based on the similarity scores and the labels of pairs of vectors. The parameters of the model are tuned to minimize the loss function. The label of two vectors indicates a degree of similarity of the objects. The label may be a binary number or a real-valued number. The function for computing similarity scores may be a cosine, Jaccard, or differentiable function. The loss function may compare pairs of vectors to their labels. Each element of the output vector is a linear or non-linear function of the terms of an input vector. The text objects may be different types of documents and two different models may be trained concurrently.
摘要翻译: 公开了将文本对象的原始文本表示映射到向量空间的模型。 定义了一个功能,用于计算给定两个输出向量的相似度得分。 定义了一种损失函数,用于计算基于相似度得分和向量对的标签的误差。 调整模型的参数以最小化损失函数。 两个向量的标签表示对象的相似度。 标签可以是二进制数字或实数值。 用于计算相似性分数的函数可以是余弦,Jaccard或可微分函数。 损失函数可以将向量对与其标签进行比较。 输出向量的每个元素是输入向量的项的线性或非线性函数。 文本对象可以是不同类型的文档,并且可以同时训练两个不同的模型。
-
公开(公告)号:US07693806B2
公开(公告)日:2010-04-06
申请号:US11766434
申请日:2007-06-21
CPC分类号: H04L51/12 , G06K9/6256 , G06Q10/06 , G06Q10/10
摘要: A system and method that facilitates and effectuates optimizing a classifier for greater performance in a specific region of classification that is of interest, such as a low false positive rate or a low false negative rate. A two-stage classification model can be trained and employed, where the first stage classification is optimized over the entire classification region and the second stage classifier is optimized for the specific region of interest. During training the entire set of training data is employed by a first stage classifier. Only data that is classified by the first stage classifier or by cross validation to fall within a region of interest is used to train the second stage classifier. During classification, data that is classified within the region of interest by the first classification is given the first stage classifier's classification value, otherwise the classification value for the instance of data from the second stage classifier is used.
摘要翻译: 促进并实现分类器在特定感兴趣区域中的更高性能的系统和方法,例如低假阳性率或低假阴性率。 可以训练和采用两阶段分类模型,其中对整个分类区域优化第一阶段分类,并针对特定的兴趣区域优化第二阶段分类器。 在训练期间,整套训练数据由第一阶段分类器采用。 仅使用由第一阶段分类器分类的数据或通过交叉验证落入感兴趣区域内的数据来训练第二阶段分类器。 在分类期间,通过第一分类对分类在感兴趣区域内的数据给予第一阶段分类器的分类值,否则使用来自第二阶段分类器的数据实例的分类值。
-
公开(公告)号:US08290946B2
公开(公告)日:2012-10-16
申请号:US12144647
申请日:2008-06-24
申请人: Wen-tau Yih , Christopher A. Meek
发明人: Wen-tau Yih , Christopher A. Meek
CPC分类号: G06F17/30687 , G06Q30/02
摘要: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.
摘要翻译: 描述了两种衡量关键字 - 文档相关性的方法。 方法接收关键字和文档作为输入,并输出关键字的概率值。 第一种方法是基于相似性的方法,其使用用于测量两个短文本段之间的相似性的技术来测量关键字和文档之间的相关性。 第二种方法是基于回归的方法,基于一个假设,如果文档外短语(关键字)在语义上类似于文档内短语,则文本内和外的短语的相关性分数应为 彼此接近
-
公开(公告)号:US20110219012A1
公开(公告)日:2011-09-08
申请号:US12715417
申请日:2010-03-02
摘要: Described is a technology for measuring the similarity between two objects (e.g., documents), via a framework that learns the term-weighting function from training data, e.g., labeled pairs of objects, to develop a learned model. A learning procedure tunes the model parameters by minimizing a defined loss function of the similarity score. Also described is using the learning procedure and learned model to detect near duplicate documents.
摘要翻译: 描述了一种用于通过从训练数据(例如标记的对象对)学习术语加权函数的框架来测量两个对象(例如,文档)之间的相似性的技术,以开发学习的模型。 学习过程通过最小化相似性得分的定义的损失函数来调整模型参数。 还描述了使用学习过程和学习模型来检测近似重复的文档。
-
公开(公告)号:US20090319508A1
公开(公告)日:2009-12-24
申请号:US12144647
申请日:2008-06-24
申请人: Wen-tau Yih , Christopher A. Meek
发明人: Wen-tau Yih , Christopher A. Meek
CPC分类号: G06F17/30687 , G06Q30/02
摘要: Two methods for measuring keyword-document relevance are described. The methods receive a keyword and a document as input and output a probability value for the keyword. The first method is a similarity-based approach which uses techniques for measuring similarity between two short-text segments to measure relevance between the keyword and the document. The second method is a regression-based approach based on an assumption that if an out-of-document phrase (the keyword) is semantically similar to an in-document phrase, then relevance scores of the in and out-of document phrases should be close to each other.
摘要翻译: 描述了两种衡量关键字 - 文档相关性的方法。 方法接收关键字和文档作为输入,并输出关键字的概率值。 第一种方法是基于相似性的方法,其使用用于测量两个短文本段之间的相似性的技术来测量关键字和文档之间的相关性。 第二种方法是基于回归的方法,基于一个假设,如果文档外短语(关键字)在语义上类似于文档内短语,则文本内和外的短语的相关性分数应为 彼此接近
-
公开(公告)号:US20070083357A1
公开(公告)日:2007-04-12
申请号:US11485015
申请日:2006-07-12
申请人: Robert Moore , Wen-tau Yih , Galen Andrew , Kristina Toutanova
发明人: Robert Moore , Wen-tau Yih , Galen Andrew , Kristina Toutanova
IPC分类号: G06F17/28
CPC分类号: G06F17/2827 , G06F17/2836
摘要: A weighted linear word alignment model linearly combines weighted features to score a word alignment for a bilingual, aligned pair of text fragments. The features are each weighted by a feature weight. One of the features is a word association metric, which may be generated from surface statistics.
摘要翻译: 加权线性字对齐模型线性组合加权特征以对双语对齐的文本片段对进行字对齐。 特征各自由特征权重加权。 特征之一是字关联度量,其可以从表面统计量生成。
-
公开(公告)号:US20130159320A1
公开(公告)日:2013-06-20
申请号:US13329345
申请日:2011-12-19
申请人: Jianfeng Gao , Kristina Toutanova , Wen-tau Yih
发明人: Jianfeng Gao , Kristina Toutanova , Wen-tau Yih
IPC分类号: G06F17/30
CPC分类号: G06F17/30867
摘要: There is provided a computer-implemented method and system for ranking documents. The method includes identifying a number of query-document pairs based on clickthrough data for a number of documents. The method also includes building a latent semantic model based on the query-document pairs and ranking the documents for a search based on the latent semantic model.
摘要翻译: 提供了用于对文档进行排序的计算机实现的方法和系统。 该方法包括基于多个文档的点击数据来识别多个查询文档对。 该方法还包括基于查询文档对构建潜在语义模型,并根据潜在语义模型对搜索文档进行排序。
-
公开(公告)号:US08135728B2
公开(公告)日:2012-03-13
申请号:US11619230
申请日:2007-01-03
CPC分类号: G06F17/241 , G06F17/27 , G06F17/30 , G06F17/30616
摘要: Extraction analysis techniques biased, in part, by query frequency information from a query log file and/or search engine cache are employed along with machine learning processes to determine candidate keywords and/or phrases of web documents. Web oriented features associated with the candidate keywords and/or phrases are also utilized to analyze the web documents. A keyword and/or phrase extraction mechanism can be utilized to score keywords and/or phrases in a web document and estimate a likelihood that the keywords and/or phrases are relevant, for example, in an advertising system and the like.
摘要翻译: 提取分析技术部分地通过来自查询日志文件和/或搜索引擎高速缓冲存储器的查询频率信息以及机器学习过程来偏移来确定web文档的候选关键字和/或短语。 与候选关键字和/或短语相关联的面向Web的功能也用于分析网络文档。 可以使用关键字和/或短语提取机制来评估网络文档中的关键字和/或短语,并估计关键词和/或短语相关的可能性,例如在广告系统等中。
-
公开(公告)号:US20080109425A1
公开(公告)日:2008-05-08
申请号:US11591937
申请日:2006-11-02
CPC分类号: G06F17/30719
摘要: Document summarization is performed by scoring individual words in sentences in a document or document cluster. Sentences from the document or document cluster are selected to form a summary based on the scores of the words contained in those sentences.
摘要翻译: 通过在文档或文档集群中的句子中的单个单词进行评分来执行文档摘要。 选择文档或文档集合中的句子,以便根据这些句子中包含的单词的分数来形成一个摘要。
-
-
-
-
-
-
-
-
-