发明申请
- 专利标题: DETECTING DUPLICATE DOCUMENTS USING CLASSIFICATION
- 专利标题(中): 使用分类检测重复文件
-
申请号: US12472758申请日: 2009-05-27
-
公开(公告)号: US20100306204A1公开(公告)日: 2010-12-02
- 发明人: Srinivas V. Chitiveli , Barton W. Emanuel , Alexander W. Holt , Michael E. Moran
- 申请人: Srinivas V. Chitiveli , Barton W. Emanuel , Alexander W. Holt , Michael E. Moran
- 申请人地址: US NY Armonk
- 专利权人: INTERNATIONAL BUSINESS MACHINES CORPORATION
- 当前专利权人: INTERNATIONAL BUSINESS MACHINES CORPORATION
- 当前专利权人地址: US NY Armonk
- 主分类号: G06F17/30
- IPC分类号: G06F17/30 ; G06F7/00 ; G06F12/00
摘要:
Systems, methods and articles of manufacture are disclosed for detecting a duplicate document. A plurality of documents may be assigned to categories, each category corresponding to a collection of duplicates, or near duplicate documents. A new document may be received. The new document may be evaluated against each category to determine a similarity score between the new document and each category. The new document may be identified as a duplicate based on the similarity scores and thresholds for each category. An action may then be performed on the duplicate based on duplication rules. The thresholds and duplication rules may be customized by a user.
公开/授权文献
- US08180773B2 Detecting duplicate documents using classification 公开/授权日:2012-05-15
信息查询