Data de-duplication

发明授权

请登陆查看更多内容

专利标题： Data de-duplication
申请号： US14716910

申请日： 2015-05-20
公开(公告)号： US10467203B2

公开(公告)日： 2019-11-05
发明人: Namit Kabra , Yannick Saillet
申请人： International Business Machines Corporation
申请人地址： US NY Armonk
专利权人： International Business Machines Corporation
当前专利权人： International Business Machines Corporation
当前专利权人地址： US NY Armonk
代理商 Steven F. McDaniel; David S. Richart; Arnold B. Bangali
主分类号： G06F16/215
IPC分类号： G06F16/215 ; G06F16/23

摘要：

A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.

公开/授权文献

US20160092479A1 DATA DE-DUPLICATION 公开/授权日：2016-03-31

信息查询

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/20	.•结构化数据，例如关系型数据
G06F16/21	..••数据库设计、管理或维护
G06F16/215	...•••提高数据质量；数据清理，例如重复数据消除、删除无效条目或更正排版错误