发明申请
US20060106767A1 System and method for identifying query-relevant keywords in documents with latent semantic analysis
有权
在具有潜在语义分析的文档中识别查询相关关键词的系统和方法
- 专利标题: System and method for identifying query-relevant keywords in documents with latent semantic analysis
- 专利标题(中): 在具有潜在语义分析的文档中识别查询相关关键词的系统和方法
-
申请号: US10987377申请日: 2004-11-12
-
公开(公告)号: US20060106767A1公开(公告)日: 2006-05-18
- 发明人: John Adcock , Matthew Cooper , Andreas Girgensohn , Lynn Wilcox
- 申请人: John Adcock , Matthew Cooper , Andreas Girgensohn , Lynn Wilcox
- 申请人地址: JP Tokyo
- 专利权人: Fuji Xerox Co., Ltd.
- 当前专利权人: Fuji Xerox Co., Ltd.
- 当前专利权人地址: JP Tokyo
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
A system and method for identifying query-related keywords in documents found in a search using latent semantic analysis. The documents are represented as a document term matrix M containing one or more document term-weight vectors d, which may be term-frequency (tf) vectors or term-frequency inverse-document-frequency (tf-idf) vectors. This matrix is subjected to a truncated singular value decomposition. The resulting transform matrix U can be used to project a query term-weight vector q into the reduced N-dimensional space, followed by its expansion back into the full vector space using the inverse of U. To perform a search, the similarity of qexpanded is measured relative to each candidate document vector in this space. Exemplary similarity functions are dot product and cosine similarity. Keywords are selected with the highest values in qexpanded that are also comprised in at least one document. Matching keywords from the query may be highlighted in the search results.
公开/授权文献
信息查询