-
公开(公告)号:US09081817B2
公开(公告)日:2015-07-14
申请号:US13084527
申请日:2011-04-11
申请人: Arvind Arasu , Michaela Götz , Shriraghav Kaushik
发明人: Arvind Arasu , Michaela Götz , Shriraghav Kaushik
CPC分类号: G06F17/30507 , G06N99/005
摘要: An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.
摘要翻译: 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。
-
公开(公告)号:US20120259802A1
公开(公告)日:2012-10-11
申请号:US13084527
申请日:2011-04-11
申请人: Arvind Arasu , Michaela Götz , Shriraghav Kaushik
发明人: Arvind Arasu , Michaela Götz , Shriraghav Kaushik
IPC分类号: G06F15/18
CPC分类号: G06F17/30507 , G06N99/005
摘要: An active learning record matching system and method for producing a record matching package that is used to identify pairs of duplicate records. Embodiments of the system and method allow a precision threshold to be specified and then generate a learned record matching package having precision greater than this threshold and a recall close to the best possible recall. Embodiments of the system and method use a blocking technique to restrict the space of record matching packages considered and scale to large inputs. The learning method considers several record matching packages, estimates the precision and recall of the packages, and identifies the package with maximum recall having precision greater than equal to the given precision threshold. A human domain expert labels a sample of record pairs in the output of the package as matches or non-matches and this labeling is used to estimate the precision of the package.
摘要翻译: 用于产生用于识别重复记录对的记录匹配包的主动学习记录匹配系统和方法。 系统和方法的实施例允许指定精度阈值,然后产生具有大于该阈值的精度的学习记录匹配包以及接近最佳可能召回的召回。 系统和方法的实施例使用阻塞技术来限制所考虑的记录匹配包的空间并将其缩放到大的输入。 学习方法考虑了几个记录匹配包,估计包的精度和调用,并且识别具有大于等于给定精度阈值的精度的最大召回的包。 人类领域专家将包的输出中的记录对的样本标记为匹配或不匹配,并且该标签用于估计包的精度。
-