- 专利标题: Method for clustering closely resembling data objects
-
申请号: US48653申请日: 1998-03-26
-
公开(公告)号: US6119124A公开(公告)日: 2000-09-12
- 发明人: Andrei Z. Broder , Steven C. Glassman , Charles G. Nelson , Mark S. Manasse , Geoffrey G. Zweig
- 申请人: Andrei Z. Broder , Steven C. Glassman , Charles G. Nelson , Mark S. Manasse , Geoffrey G. Zweig
- 申请人地址: MA Maynard
- 专利权人: Digital Equipment Corporation
- 当前专利权人: Digital Equipment Corporation
- 当前专利权人地址: MA Maynard
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.
公开/授权文献
- USD378744S Restriction indicator 公开/授权日:1997-04-08
信息查询