-
公开(公告)号:US20100179933A1
公开(公告)日:2010-07-15
申请号:US12562802
申请日:2009-09-18
申请人: BING BAI , Jason Weston , Ronan Collobert , David Grangier
发明人: BING BAI , Jason Weston , Ronan Collobert , David Grangier
CPC分类号: G06F17/30663 , G06F17/30616
摘要: A system and method for determining a similarity between a document and a query includes building a weight vector for each of a plurality of documents in a corpus of documents stored in memory and building a weight vector for a query input into a document retrieval system. A weight matrix is generated which distinguishes between relevant documents and lower ranked documents by comparing document/query tuples using a gradient step approach. A similarity score is determined between weight vectors of the query and documents in a corpus by determining a product of a document weight vector, a query weight vector and the weight matrix.
摘要翻译: 用于确定文档和查询之间的相似度的系统和方法包括为存储在存储器中的文档的语料库中的多个文档中的每个文档建立权重向量,并且建立用于向文档检索系统输入的查询的加权向量。 生成权重矩阵,通过使用梯度步骤方法比较文档/查询元组来区分相关文档和较低排名的文档。 通过确定文档权重向量,查询权重向量和权重矩阵的乘积,在查询的权重向量和语料库中的文档之间确定相似性得分。
-
公开(公告)号:US20100185659A1
公开(公告)日:2010-07-22
申请号:US12562840
申请日:2009-09-18
申请人: BING BAI , JASON WESTON , RONAN COLLORBERT , DAVID GRANGIER
发明人: BING BAI , JASON WESTON , RONAN COLLORBERT , DAVID GRANGIER
IPC分类号: G06F17/30
CPC分类号: G06F17/30663 , G06F17/30616
摘要: A system and method for determining a similarity between a document and a query includes providing a frequently used dictionary and an infrequently used dictionary in storage memory. For each word or gram in the infrequently used dictionary, n words or grams are correlated from the frequently used dictionary based on a first score. Features for a vector of the infrequently used words or grams are replaced with features from a vector of the correlated words or grams from the frequently used dictionary when the features from a vector of the correlated words or grams meet a threshold value. A similarity score is determined between weight vectors of a query and one or more documents in a corpus by employing the features from the vector of the correlated words or grams that met the threshold value.
摘要翻译: 用于确定文档和查询之间的相似性的系统和方法包括在存储存储器中提供频繁使用的字典和不经常使用的字典。 对于不经常使用的字典中的每个单词或克,n个词或克根据第一个分数与经常使用的词典相关联。 当相关词或克的向量的特征符合阈值时,不经常使用的单词或克的向量的特征将被来自经常使用的词典的相关词或克的向量的特征替换。 通过使用满足阈值的相关词或克的向量的特征,在查询的权重向量和语料库中的一个或多个文档之间确定相似性得分。
-
3.
公开(公告)号:US20120310627A1
公开(公告)日:2012-12-06
申请号:US13483868
申请日:2012-05-30
IPC分类号: G06F17/27
CPC分类号: G06F17/2785
摘要: Methods and systems for document classification include embedding n-grams from an input text in a latent space, embedding the input text in the latent space based on the embedded n-grams and weighting said n-grams according to spatial evidence of the respective n-grams in the input text, classifying the document along one or more axes, and adjusting weights used to weight the n-grams based on the output of the classifying step.
摘要翻译: 用于文档分类的方法和系统包括:从潜在空间中的输入文本嵌入n-gram,基于嵌入的n-gram将输入文本嵌入到潜在空间中,并根据相应的n-gram的空间证据对所述n-gram进行加权, 输入文本中的克,沿着一个或多个轴对文档进行分类,以及基于分类步骤的输出来调整用于加权n-gram的权重。
-
-