发明授权
- 专利标题: Optimized mapping of documents to candidate duplicate documents in a document corpus
-
申请号: US14573849申请日: 2014-12-17
-
公开(公告)号: US09607029B1公开(公告)日: 2017-03-28
- 发明人: Sivaranjini Dharmalingam , Nathan Thomas Close , Shantanu Shailendrakumar Fauji , Sean Gwizdak , Jiahui Jiang , Yohan Mammen , Roshan Rammohan
- 申请人: Amazon Technologies, Inc.
- 申请人地址: US WA Seattle
- 专利权人: Amazon Technologies, Inc.
- 当前专利权人: Amazon Technologies, Inc.
- 当前专利权人地址: US WA Seattle
- 代理机构: Lee & Hayes, PLLC
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.
信息查询