Text joins for data cleansing and integration in a relational database management system
    1.
    发明申请
    Text joins for data cleansing and integration in a relational database management system 审中-公开
    文本连接用于关系数据库管理系统中的数据清理和集成

    公开(公告)号:US20050027717A1

    公开(公告)日:2005-02-03

    申请号:US10828819

    申请日:2004-04-21

    IPC分类号: G06F7/02 G06F17/30

    摘要: An organization's data records are often noisy: because of transcription errors, incomplete information, and lack of standard formats for textual data. A fundamental task during data cleansing and integration is matching strings—perhaps across multiple relations—that refer to the same entity (e.g., organization name or address). Furthermore, it is desirable to perform this matching within an RDBMS, which is where the data is likely to reside. In this paper, We adapt the widely used and established cosine similarity metric from the information retrieval field to the relational database context in order to identify potential string matches across relations. We then use this similarity metric to characterize this key aspect of data cleansing and integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose an approximate, sampling-based approach to the join problem that can be easily and efficiently executed in a standard, unmodified RDBMS. Therefore the present invention includes a system for string matching across multiple relations in a relational database management system comprising generating a set of strings from a set of characters, decomposing each string into a subset of tokens, establishing at least two relations within the strings, establishing a similarity threshold for the relations, sampling the at least two relations, correlating the relations for the similarity threshold and returning all of the tokens which meet the criteria of the similarity threshold.

    摘要翻译: 组织的数据记录通常是嘈杂的:因为转录错误,信息不完整以及文本数据的标准格式不足。 在数据清理和集成过程中,一个基本任务是匹配字符串(可能是跨多个关系),它们指的是同一个实体(例如,组织名称或地址)。 此外,希望在数据可能驻留的RDBMS内执行该匹配。 在本文中,我们将广泛使用和建立的余弦相似性度量从信息检索领域适应到关系数据库上下文,以便识别跨关系的潜在字符串匹配。 然后,我们使用这种相似性度量来表征数据清理和集成的这个关键方面,作为文本属性之间的关系之间的连接,其中匹配的相似性超过了指定的阈值。 计算文本连接的确切答案可能是昂贵的。 对于查询处理效率,我们提出了一种基于抽样的近似方法,可以在标准的未修改的RDBMS中轻松有效地执行连接问题。 因此,本发明包括一种用于在关系数据库管理系统中跨多个关系进行字符串匹配的系统,包括从一组字符生成一组字符串,将每个字符串分解为令牌子集,建立字符串内的至少两个关系,建立 关系的相似性阈值,对至少两个关系进行采样,将相似性阈值的关系相关联并返回满足相似性阈值的标准的所有令牌。

    Method and apparatus for ranked join indices
    4.
    发明授权
    Method and apparatus for ranked join indices 有权
    排名连接索引的方法和装置

    公开(公告)号:US07185012B1

    公开(公告)日:2007-02-27

    申请号:US10775056

    申请日:2004-02-09

    IPC分类号: G06F17/30

    摘要: A method and apparatus for ranked join indices includes a solution providing performance guarantees for top-k join queries over two relations, when preprocessing to construct a ranked join index for a specific join condition is permitted. The concepts of ranking join indices presented herein are also applicable in the case of a single relation. In this case, the concepts herein provide a solution to the top-k selection problem with monotone linear functions, having guaranteed worst case search performance for the case of two ranked attributes and arbitrary preference vectors.

    摘要翻译: 用于分级连接索引的方法和装置包括当允许对特定连接条件构建排名连接索引的预处理时,提供针对两个关系的top-k连接查询的性能保证的解决方案。 在这里提出的排名连接指数的概念也适用于单一关系的情况。 在这种情况下,这里的概念提供了对单调线性函数的top-k选择问题的解决方案,对于两个排序的属性和任意偏好向量的情况,保证了最差情况搜索性能。

    Method and system for performing queries on data streams
    5.
    发明授权
    Method and system for performing queries on data streams 有权
    对数据流执行查询的方法和系统

    公开(公告)号:US07904444B1

    公开(公告)日:2011-03-08

    申请号:US11411478

    申请日:2006-04-26

    IPC分类号: G06F7/00

    CPC分类号: G06F17/30516 Y10S707/922

    摘要: A method and system for performing a data stream query. A data stream query requiring a join operation on multiple data streams is approximated without performing the join operation. It is determined whether conditions of the query are proper to accurately approximate the join operation, and if the conditions are proper the join operation is approximated. The join operation is approximated by independently aggregating values of the data streams and comparing the independently aggregated values.

    摘要翻译: 一种用于执行数据流查询的方法和系统。 在不执行连接操作的情况下,近似需要在多个数据流上进行连接操作的数据流查询。 确定查询的条件是否适合准确地近似连接操作,并且如果条件合适,则接近操作被近似。 通过独立地聚合数据流的值并比较独立的聚合值来近似加入操作。

    Routing XML queries
    6.
    发明授权
    Routing XML queries 失效
    路由XML查询

    公开(公告)号:US07664806B1

    公开(公告)日:2010-02-16

    申请号:US10830285

    申请日:2004-04-22

    IPC分类号: G06F7/00 G06F15/16

    CPC分类号: G06F17/30929 G06F17/30545

    摘要: A vast amount of information currently accessible over the Web, and in corporate networks, is stored in a variety of databases, and is being exported as XML data. However, querying this totality of information in a declarative and timely fashion is problematic because this set of databases is dynamic, and a common schema is difficult to maintain. The present invention provides a solution to the problem of issuing declarative, ad hoc XPath queries against such a dynamic collection of XML databases, and receiving timely answers. There is proposed a decentralized architectures, under the open and the agreement cooperation models between a set of sites, for processing queries and updates to XML data. Each site consists of XML data nodes. (which export their data as XML, and also pose queries) and one XML router node (which manages the query and update interactions between sites). The architectures differ in the degree of knowledge individual router nodes have about data nodes containing specific XML data. There is therefore provided a method for accessing data over a wide area network comprising: providing a decentralized architecture comprising a plurality of data nodes each having a database, a query processor and a path index, and a plurality of router nodes each having a routing state, maintaining a routing state in each of the router nodes, broadcasting routing state updates from each of the databases to the router nodes, routing path queries to each of the databases by accessing the routing state.

    摘要翻译: 目前可以通过Web和企业网络访问的大量信息存储在各种数据库中,并作为XML数据导出。 然而,以声明和及时的方式查询这些信息是有问题的,因为这组数据库是动态的,并且常见的模式很难维护。 本发明提供了解决针对XML数据库的这种动态集合发出声明性特征XPath查询并及时接收答案的问题的解决方案。 提出了一种分散架构,在一组网站之间的开放协议合作模式下,用于处理查询和更新XML数据。 每个站点由XML数据节点组成。 (它们以XML格式导出数据,并提供查询)和一个XML路由器节点(管理查询和更新站点之间的交互)。 各种路由器节点对包含特定XML数据的数据节点的知识程度不同。 因此,提供了一种用于通过广域网访问数据的方法,包括:提供分散式架构,其包括多个数据节点,每个数据节点具有数据库,查询处理器和路径索引,以及多个路由器节点,每个节点具有路由状态 在每个路由器节点中保持路由状态,从每个数据库向路由器节点广播路由状态更新,通过访问路由状态将路由查询路由到每个数据库。

    Systems and associated computer program products that disguise partitioned data structures using transformations having targeted distributions
    7.
    发明授权
    Systems and associated computer program products that disguise partitioned data structures using transformations having targeted distributions 有权
    使用具有目标分布的转换来伪装分区数据结构的系统和相关的计算机程序产品

    公开(公告)号:US08209342B2

    公开(公告)日:2012-06-26

    申请号:US12262706

    申请日:2008-10-31

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30536 G06F21/6254

    摘要: A data structure that includes at least one partition containing non-confidential quasi-identifier microdata and at least one other partition containing confidential microdata is formed. The partitioned confidential microdata is disguised by transforming the confidential microdata to conform to a target distribution. The disguised confidential microdata and the quasi-identifier microdata are combined to generate a disguised data structure. The disguised data structure is used to carry out statistical analysis and to respond to a statistical query is directed to the use of confidential microdata. In this manner, the privacy of the confidential microdata is preserved.

    摘要翻译: 形成包括至少一个包含非机密准标识符微数据的分区和至少一个包含机密微数据的分区的数据结构。 分割的机密微数据通过转换机密微数据符合目标分配来伪装。 伪装的机密微数据和准标识符微数据被组合以产生伪装的数据结构。 伪装的数据结构用于进行统计分析,并响应统计查询是针对使用机密微数据。 以这种方式,保密机密微数据的隐私被保留。