AUTOMATED DATA DUPLICATE IDENTIFICATION

Invention Application

US20160162507A1 AUTOMATED DATA DUPLICATE IDENTIFICATION 审中-公开

Title translation: 自动数据重复标识

Please log in to see more content

Patent Title: AUTOMATED DATA DUPLICATE IDENTIFICATION
Patent Title (中): 自动数据重复标识
Application No.: US14561927

Application Date: 2014-12-05
Publication No.: US20160162507A1

Publication Date: 2016-06-09
Inventor: Ritesh K. Gupta , Namit Kabra , Manish Kumar , Srinivas K. Mittapalli
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Main IPC: G06F17/30
IPC: G06F17/30

Abstract:

In an approach to identifying duplicates in data, one or more computer processors receive a request from a user to identify duplicates in a data set. The one or more computer processors retrieve the data set utilizing data discovery. The one or more computer processors perform data profiling on the data set. The one or more computer processors determine one or more domain types of the data set, based, at least in part, on the performed data profiling. The one or more computer processors perform data standardization on the data set, based, at least in part, on the one or more determined domain types. Responsive to performing data standardization, the one or more computer processors perform probabilistic matching on the data set. The one or more computer processors to identify two or more duplicates in the data set, based, at least in part, on the probabilistic matching.

Abstract(Chinese):

在识别数据中的重复的方法中，一个或多个计算机处理器从用户接收请求以识别数据集中的重复。一个或多个计算机处理器利用数据发现来检索数据集。一个或多个计算机处理器对数据集进行数据分析。所述一个或多个计算机处理器至少部分地基于所执行的数据分析来确定所述数据集的一个或多个域类型。一个或多个计算机处理器至少部分地基于一个或多个确定的域类型来对数据集执行数据标准化。响应于执行数据标准化，一个或多个计算机处理器对数据集执行概率匹配。所述一个或多个计算机处理器至少部分地基于概率匹配来识别所述数据集中的两个或更多个重复项。

Information query

Global Dossier Espacenet