Invention Application
- Patent Title: AUTOMATED DATA DUPLICATE IDENTIFICATION
- Patent Title (中): 自动数据重复标识
-
Application No.: US14561927Application Date: 2014-12-05
-
Publication No.: US20160162507A1Publication Date: 2016-06-09
- Inventor: Ritesh K. Gupta , Namit Kabra , Manish Kumar , Srinivas K. Mittapalli
- Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
In an approach to identifying duplicates in data, one or more computer processors receive a request from a user to identify duplicates in a data set. The one or more computer processors retrieve the data set utilizing data discovery. The one or more computer processors perform data profiling on the data set. The one or more computer processors determine one or more domain types of the data set, based, at least in part, on the performed data profiling. The one or more computer processors perform data standardization on the data set, based, at least in part, on the one or more determined domain types. Responsive to performing data standardization, the one or more computer processors perform probabilistic matching on the data set. The one or more computer processors to identify two or more duplicates in the data set, based, at least in part, on the probabilistic matching.
Information query