Multi-Pass Duplicate Identification Using Sorted Neighborhoods and Aggregation Techniques

    公开(公告)号:US20180210903A1

    公开(公告)日:2018-07-26

    申请号:US15413144

    申请日:2017-01-23

    Applicant: SAP SE

    CPC classification number: G06F16/215

    Abstract: Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.

    Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques

    公开(公告)号:US10409788B2

    公开(公告)日:2019-09-10

    申请号:US15413144

    申请日:2017-01-23

    Applicant: SAP SE

    Abstract: Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.

Patent Agency Ranking