Efficient indexing of documents with similar content

发明授权

US08175875B1 Efficient indexing of documents with similar content 有权

请登陆查看更多内容

专利标题： Efficient indexing of documents with similar content
申请号： US11419423

申请日： 2006-05-19
公开(公告)号： US08175875B1

公开(公告)日： 2012-05-08
发明人: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
申请人： Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
申请人地址： US CA Mountain View
专利权人： Google Inc.
当前专利权人： Google Inc.
当前专利权人地址： US CA Mountain View
代理机构： Morgan, Lewis & Bockius LLP
主分类号： G10L15/06
IPC分类号： G10L15/06

Efficient indexing of documents with similar content

摘要：

A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.

信息查询

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）
G10L15/06	.创建基准模板；训练语音识别系统，例如对说话者声音特征的适应（G10L15/14优先）