-
公开(公告)号:US09607029B1
公开(公告)日:2017-03-28
申请号:US14573849
申请日:2014-12-17
发明人: Sivaranjini Dharmalingam , Nathan Thomas Close , Shantanu Shailendrakumar Fauji , Sean Gwizdak , Jiahui Jiang , Yohan Mammen , Roshan Rammohan
IPC分类号: G06F17/30
CPC分类号: G06F17/30324
摘要: Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.