-
公开(公告)号:US06961721B2
公开(公告)日:2005-11-01
申请号:US10186031
申请日:2002-06-28
Applicant: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
Inventor: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
CPC classification number: G06F17/30303 , Y10S707/99931 , Y10S707/99942
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
Abstract translation: 本发明涉及对数据库中的重复元组的检测。 复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数(例如,编辑距离,余弦度量)。 然而,如果这些现有技术的方法用于识别领域特定的缩写和惯例,则会产生大量的假阳性。 根据本发明,基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程,数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。 本发明利用表层次结构中可用的额外知识来开发高质量,可扩展的重复检测过程。
-
公开(公告)号:US08788502B1
公开(公告)日:2014-07-22
申请号:US13191345
申请日:2011-07-26
Applicant: Chase Hensel , Jayakumar Hoskere , Rohit Ananthakrishna
Inventor: Chase Hensel , Jayakumar Hoskere , Rohit Ananthakrishna
CPC classification number: G06F17/30864 , G06F17/30011 , G06F17/30241
Abstract: A server may receive an article that is retrieved from a server; determine whether the article satisfies first criteria based on content of the first article; annotate the article with a first article type when the article satisfies the first criteria; determine whether the article satisfies second criteria based on information associated with the article; annotate the article with a second article type when the article satisfies the second criteria; and store the article in association with a topic and at least one of the first article type or the second article type. The different server may be associated with a news website.
Abstract translation: 服务器可以接收从服务器检索的文章; 根据第一篇文章的内容确定文章是否满足第一标准; 当文章满足第一个标准时,用第一个文章类型注释文章; 根据与文章相关的信息确定文章是否满足第二标准; 当文章满足第二个标准时,用第二个文章类型注释文章; 并且存储与主题相关联的文章以及第一文章类型或第二文章类型中的至少一个。 不同的服务器可能与新闻网站相关联。
-
公开(公告)号:US07685090B2
公开(公告)日:2010-03-23
申请号:US11182590
申请日:2005-07-14
Applicant: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
Inventor: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
IPC: G06F17/30
CPC classification number: G06F17/30303 , Y10S707/99931 , Y10S707/99942
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
Abstract translation: 本发明涉及对数据库中的重复元组的检测。 复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数(例如,编辑距离,余弦度量)。 然而,如果这些现有技术的方法用于识别领域特定的缩写和惯例,则会产生大量的假阳性。 根据本发明,基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程,数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。 本发明利用表层次结构中可用的额外知识来开发高质量,可扩展的重复检测过程。
-
公开(公告)号:US20050262044A1
公开(公告)日:2005-11-24
申请号:US11182590
申请日:2005-07-14
Applicant: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
Inventor: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
CPC classification number: G06F17/30303 , Y10S707/99931 , Y10S707/99942
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
Abstract translation: 本发明涉及对数据库中的重复元组的检测。 复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数(例如,编辑距离,余弦度量)。 然而,如果这些现有技术的方法用于识别领域特定的缩写和惯例,则会产生大量的假阳性。 根据本发明,基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程,数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。 本发明利用表层次结构中可用的额外知识来开发高质量,可扩展的重复检测过程。
-
-
-