-
公开(公告)号:US08312021B2
公开(公告)日:2012-11-13
申请号:US11228924
申请日:2005-09-16
申请人: Irina Matveeva , Ayman Farahart
发明人: Irina Matveeva , Ayman Farahart
IPC分类号: G06F7/00
CPC分类号: G06F17/30675 , G06F17/2715
摘要: One embodiment of the present invention provides a system that builds an association tensor (such as a matrix) to facilitate document and word-level processing operations. During operation, the system uses terms from a collection of documents to build an association tensor, which contains values representing pair-wise similarities between terms in the collection of documents. During this process, if a given value in the association tensor is calculated based on an insufficient number of samples, the system determines a corresponding value from a reference document collection, and then substitutes the corresponding value for the given value in the association tensor. After the association tensor is obtained, a dimensionality reduction method is applied to compute a low-dimensional vector space representation for the vocabulary terms. Document vectors are computed as linear combinations of term vectors.
摘要翻译: 本发明的一个实施例提供了构建关联张量(诸如矩阵)以便于文档和字级处理操作的系统。 在操作期间,系统使用文档集合中的术语来构建关联张量,其包含表示文档集合中的术语之间的成对相似性的值。 在此过程中,如果基于样本数量不足计算关联张量中的给定值,则系统从参考文档集合中确定相应的值,然后将相应的值替换为关联张量中的给定值。 在获得关联张量之后,应用维数降低方法来计算词汇项的低维向量空间表示。 文档向量被计算为项向量的线性组合。