Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques

Invention Grant

US10409788B2 Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques 有权

Please log in to see more content

Patent Title: Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques
Application No.: US15413144

Application Date: 2017-01-23
Publication No.: US10409788B2

Publication Date: 2019-09-10
Inventor: Larissa Heissler , Andre Adam , Philipp Mail , Florian Hoffmann
Applicant: SAP SE
Applicant Address: DE Walldorf
Assignee: SAP SE
Current Assignee: SAP SE
Current Assignee Address: DE Walldorf
Agency: Jones Day
Main IPC: G06F16/215
IPC: G06F16/215

Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques

Abstract:

Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.

Public/Granted literature

US20180210903A1 Multi-Pass Duplicate Identification Using Sorted Neighborhoods and Aggregation Techniques Public/Granted day:2018-07-26

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/20	.•结构化数据，例如关系型数据
G06F16/21	..••数据库设计、管理或维护
G06F16/215	...•••提高数据质量；数据清理，例如重复数据消除、删除无效条目或更正排版错误