Leveraging constraints for deduplication
    61.
    发明申请
    Leveraging constraints for deduplication 有权
    利用重复数据删除的约束

    公开(公告)号:US20080288482A1

    公开(公告)日:2008-11-20

    申请号:US11804400

    申请日:2007-05-18

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30489

    摘要: A deduplication algorithm that provides improved accuracy in data deduplication by using aggregate and/or groupwise constraints. Deduplication is accomplished using only as many of these constraints that are satisfied rather than be imposed inflexibly as hard constraints. Additionally, textual similarity between tuples is leveraged to restrict the search space. The algorithm begins with a coarse initial partition of data records and continues by raising the similarity threshold until the threshold splits a given partition. This sequence of splits defines a rich space of alternatives. Over this space, an algorithm finds a partition of the input that maximizes constraint satisfaction. In the context of groupwise aggregation constraints for deduplication all SQL (structured query language) aggregates are allowed, including summation.

    摘要翻译: 重复数据删除算法,通过使用聚合和/或分组约束来提高重复数据删除的精度。 重复数据删除使用只有这些约束满足的约束才能实现,而不是将其作为硬约束条件强制强加。 此外,利用元组之间的文本相似性来限制搜索空间。 该算法以数据记录的粗略初始分区开始,并通过提高相似性阈值继续,直到阈值分裂给定分区。 这个拆分序列定义了丰富的替代空间。 在这个空间上,一个算法找到了一个最大化约束满足度的输入分区。 在重复数据消除的分组聚合约束的上下文中,允许所有SQL(结构化查询语言)聚合,包括求和。

    EFFICIENT EVALUATION OF OBJECT FINDER QUERIES
    62.
    发明申请
    EFFICIENT EVALUATION OF OBJECT FINDER QUERIES 失效
    有效评估对象查找器

    公开(公告)号:US20070288421A1

    公开(公告)日:2007-12-13

    申请号:US11423303

    申请日:2006-06-09

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30964

    摘要: The subject disclosure pertains to a class of object finder queries that return the best target objects that match a set of given keywords. Mechanisms are provided that facilitate identification of target objects related to search objects that match a set of query keywords. Scoring mechanisms/functions are also disclosed that compute relevance scores of target objects. Further, efficient early termination techniques are provided to compute the top K target objects based on a scoring function.

    摘要翻译: 主题公开涉及一类对象查找器查询,其返回与一组给定关键字匹配的最佳目标对象。 提供了有助于识别与一组查询关键字匹配的搜索对象相关的目标对象的机制。 还公开了计算目标对象的相关性分数的评分机制/功能。 此外,提供有效的提前终止技术以基于评分功能计算顶部K个目标对象。

    Efficient fuzzy match for evaluating data records
    63.
    发明授权
    Efficient fuzzy match for evaluating data records 有权
    用于评估数据记录的高效模糊匹配

    公开(公告)号:US07296011B2

    公开(公告)日:2007-11-13

    申请号:US10600083

    申请日:2003-06-20

    IPC分类号: G06F7/00 G06F17/30

    摘要: To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.

    摘要翻译: 为了帮助确保高数据质量,数据仓库验证和清理,如果需要外部来源的传入数据元组。 在许多情况下,输入元组或输入元组的一部分必须匹配参考表中可接受的元组。 例如,分销商的销售记录中的产品名称和描述字段必须与产品参考关系中的预先记录的名称和描述字段相匹配。 所公开的系统实现有效和准确的近似或模糊匹配操作,其可以有效地清除传入元组,如果它不能与参考关系中的任何多个元组完全匹配。 使用称为q-gram的令牌子串的公开的相似度函数克服了现有技术相似度功能的限制,同时有效地执行模糊匹配过程。

    Primitive operator for similarity joins in data cleaning
    64.
    发明申请
    Primitive operator for similarity joins in data cleaning 有权
    数据清理中相似性的原始运算符

    公开(公告)号:US20070192342A1

    公开(公告)日:2007-08-16

    申请号:US11352141

    申请日:2006-02-10

    IPC分类号: G06F7/00

    摘要: A set similarity join system and method are provided. The system can be employed to facilitate data cleaning based on similarities through the identification of “close” tuples (e.g., records and/or rows). “Closeness” can be is evaluated using a similarity function(s) chosen to suit the domain and/or application. Thus, the system facilitates generic domain-independent data cleansing. The system can be employed with a foundational primitive, the set similarity join (SSJoin) operator, which can be used as a building block to implement a broad variety of notions of similarity (e.g., edit similarity, Jaccard similarity, generalized edit similarity, hamming distance, soundex, etc.) as well as similarity based on co-occurrences. The SSJoin operator can exploit the observation that set overlap can be used effectively to support a variety of similarity functions. The SSJoin operator compares values based on “sets” associated with (or explicitly constructed for) each one of them.

    摘要翻译: 提供了一种集合相似性连接系统和方法。 可以通过识别“关闭”元组(例如,记录和/或行)来基于相似性来促进系统的数据清理。 可以使用选择适合域和/或应用程序的相似性函数来评估“接近度”。 因此,该系统便于通用的域无关数据清理。 该系统可以与基本原语,即相似性连接(SSJoin)运算符一起使用,其可以用作构建块来实现各种各样的相似性概念(例如,编辑相似性,Jaccard相似性,广义编辑相似性,汉明 距离,声音等)以及基于共同出现的相似性。 SSJoin算子可以利用设置重叠的观察结果有效地用于支持各种相似度函数。 SSJoin操作符根据与其中每一个相关联(或明确构建的)的“集合”来比较值。