Revealing content reuse using coarse analysis

发明授权

US11710330B2 Revealing content reuse using coarse analysis 有权

请登陆查看更多内容

专利标题： Revealing content reuse using coarse analysis
申请号： US16460980

申请日： 2019-07-02
公开(公告)号： US11710330B2

公开(公告)日： 2023-07-25
发明人: Nathan Roy Evans , Christopher Miles White , Jonathan Karl Larson , Darren Keith Edge
申请人： Microsoft Technology Licensing, LLC
申请人地址： US WA Redmond
专利权人： Microsoft Technology Licensing, LLC
当前专利权人： Microsoft Technology Licensing, LLC
当前专利权人地址： US WA Redmond
代理机构： Schwegman Lundberg & Woessner, P.A.
主分类号： G06F16/906
IPC分类号： G06F16/906 ; G06F16/901 ; G06F40/216 ; G06V30/414 ; G06V30/416

Revealing content reuse using coarse analysis

摘要：

Systems and methods for managing content provenance are provided. A network system accesses a plurality of documents. The plurality of documents is then hashed to identify one or more content features within each of the documents. In one embodiment, the hash is a MinHash. The network system compares the content features of each of the plurality of documents to determine a similarity score between each of the plurality of documents. In one embodiment, the similarly score is a Jaccard score. The network system then clusters the plurality of documents into one or more clusters based on the similarity score of each of the plurality of documents. In one embodiment, the clustering is performed using DBSCAN. DBSCAN can be iteratively performed with decreasing epsilon values to derive clusters of related but relatively dissimilar documents. The clustering information associated with the clusters are stored for use during runtime.

公开/授权文献

US20210004583A1 Revealing Content Reuse Using Coarse Analysis 公开/授权日：2021-01-07

信息查询

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/90	.•与检索数据类型无关的数据库功能
G06F16/906	..••聚类或分类