- 专利标题: Efficient indexing of documents with similar content
-
申请号: US11419423申请日: 2006-05-19
-
公开(公告)号: US08175875B1公开(公告)日: 2012-05-08
- 发明人: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
- 申请人: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
- 申请人地址: US CA Mountain View
- 专利权人: Google Inc.
- 当前专利权人: Google Inc.
- 当前专利权人地址: US CA Mountain View
- 代理机构: Morgan, Lewis & Bockius LLP
- 主分类号: G10L15/06
- IPC分类号: G10L15/06
摘要:
A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.
信息查询