Invention Application
US20160162507A1 AUTOMATED DATA DUPLICATE IDENTIFICATION 审中-公开
自动数据重复标识

AUTOMATED DATA DUPLICATE IDENTIFICATION
Abstract:
In an approach to identifying duplicates in data, one or more computer processors receive a request from a user to identify duplicates in a data set. The one or more computer processors retrieve the data set utilizing data discovery. The one or more computer processors perform data profiling on the data set. The one or more computer processors determine one or more domain types of the data set, based, at least in part, on the performed data profiling. The one or more computer processors perform data standardization on the data set, based, at least in part, on the one or more determined domain types. Responsive to performing data standardization, the one or more computer processors perform probabilistic matching on the data set. The one or more computer processors to identify two or more duplicates in the data set, based, at least in part, on the probabilistic matching.
Information query
Patent Agency Ranking
0/0