Method and system for filtering of information entities
    12.
    发明授权
    Method and system for filtering of information entities 失效
    信息实体过滤方法和系统

    公开(公告)号:US06996572B1

    公开(公告)日:2006-02-07

    申请号:US08947221

    申请日:1997-10-08

    IPC分类号: G06F17/00

    摘要: A system and method are provided for eliciting interesting structure from a collection of entities or resources with explicit and/or implicit, static and/or dynamic relations, called “affinities,” between them. Interesting structure includes (1) notions of quality, authority, or definitiveness of information, (2) notions of relevance to a user's information need, (3) notions of similarity among the plurality of resources retrieved from a universe of resources by a query process, and (4) notions of similarity among the usages of resources by different users/servers. Similarities between entities are computed, based on similarities between the affinity values for the entities. That is, where the affinitiy values for two entities resemble each other, the two entities have a high degree of similarity. Using the similarities, the entities are ranked, clustered, etc., based on a significance derived from the similarities. The ranking, clustering, etc., makes up the interesting structure which is sought.

    摘要翻译: 提供了一种系统和方法,用于从具有明确和/或隐含,静态和/或动态关系的实体或资源集合中引出有趣的结构,在它们之间称为“亲和度”。 有趣的结构包括(1)信息的质量,权威或定义的概念,(2)与用户信息需求相关的概念,(3)通过查询过程从资源范围检索的多个资源之间的相似度概念 ,(4)不同用户/服务器资源使用情况之间的相似性概念。 基于实体的亲和度值之间的相似度来计算实体之间的相似性。 也就是说,两个实体的亲属价值相似,两个实体的相似度很高。 使用相似之处,实体根据从相似性导出的意义进行排名,聚类等。 排名,聚类等构成了有趣的结构。

    Method and system for trawling the World-wide Web to identify implicitly-defined communities of web pages
    13.
    发明授权
    Method and system for trawling the World-wide Web to identify implicitly-defined communities of web pages 失效
    拖网世界网络的方法和系统,以识别隐含定义的网页社区

    公开(公告)号:US06886129B1

    公开(公告)日:2005-04-26

    申请号:US09449697

    申请日:1999-11-24

    IPC分类号: G06F17/30

    摘要: A method and system for identifying groups of pages of common interest from a collection of hyper-linked pages are disclosed. A plurality of community cores are identified from the collection where each core includes first and second sets of pages, and each page in the first set points to every page in the second set. Each identified core is expanded into a full community which is a subset of the pages regarding a particular topic. The identification community cores is based on the analysis of the Web graph in which the communities correspond to instances of Web subgraphs. Extraneous pages are then pruned to improve the quality of the resulting communities.

    摘要翻译: 公开了一种用于从超链接页面的集合中识别共同感兴趣的页面组的方法和系统。 从集合中识别出多个社区核心,其中每个核心包括第一组和第二组页面,并且第一组中的每个页面指向第二组中的每一页。 每个识别的核心都被扩展成一个完整的社区,这是一个关于特定主题的页面的子集。 识别社区核心是基于Web图形的分析,其中社区对应于Web子图的实例。 然后修剪外来页面以提高所得社区的质量。

    System and method for hybrid hash join using over-partitioning to respond to database query
    15.
    发明授权
    System and method for hybrid hash join using over-partitioning to respond to database query 失效
    用于混合哈希连接的系统和方法使用超分区来响应数据库查询

    公开(公告)号:US06226639B1

    公开(公告)日:2001-05-01

    申请号:US09158741

    申请日:1998-09-22

    IPC分类号: G06F1730

    摘要: A system and method for joining a build table to a probe table in response to a query for data includes over partitioning the build table into “N” build partitions using a uniform hash function and writing the build partitions into main memory of a database computer. When the main memory becomes full, one or more partitions is selected as a victim partition to be written to disk storage, and the process continues until all build table rows or tuples have either been written into main memory or spilled to disk. Then, a packing algorithm is used to initially designate never-spilled partitions as “winners” and spilled partitions as “losers”, and then to randomly select one or more winners for prospective swapping with one or more losers. The I/O savings associated with each prospective swap is determined and if any savings would be realized, the winners are designated as losers the losers are designated as winners. The swap determination can be made multiple times, e.g., 256, after which losers are moved entirely to disk and winners are moved entirely to memory. At the end of the swapping, probe table rows associated with winner partitions are joined to rows in the winner build partitions while probe table rows associated with loser partitions are spilled to disk. Then, the loser build partitions are written to main memory for joining with corresponding probe table partitions, to undertake the requested join of the build table and probe table in an I/O- and memory-efficient manner.

    摘要翻译: 响应于数据查询将构建表连接到探测表的系统和方法包括使用统一散列函数将构建表过度分割为“N”构建分区,并将构建分区写入数据库计算机的主存储器。 当主内存变满时,将选择一个或多个分区作为要写入磁盘存储器的受害分区,并且该过程继续进行,直到所有构建表行或元组都已写入主内存或溢出到磁盘。 然后,打包算法用于初始地将未分配的分区指定为“获胜者”,将分区分散为“输家”,然后随机选择一个或多个获胜者进行与一个或多个输家的潜在交换。 确定与每个预期掉期相关的I / O节省,如果实现了任何节省,则获胜者被指定为失败者被指定为赢家的输家。 交换确定可以进行多次,例如256次,之后输家完全移动到磁盘,获胜者完全移动到内存。 在交换结束时,与优胜者分区关联的探测表行将连接到优胜者构建分区中的行,而与失败分区关联的探测表行会溢出到磁盘。 然后,失败者构建分区被写入主存储器以与相应的探测表分区相连接,以I / O和存储器高效的方式承载构建表和探测表的所请求的连接。

    Index partition maintenance over monotonically addressed document sequences
    17.
    发明授权
    Index partition maintenance over monotonically addressed document sequences 有权
    索引分区维护通过单调寻址的文档序列

    公开(公告)号:US08738673B2

    公开(公告)日:2014-05-27

    申请号:US12875615

    申请日:2010-09-03

    IPC分类号: G06F17/30

    摘要: Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.

    摘要翻译: 提供了用于将物理索引分割成一个或多个物理分区的技术; 将一个或多个物理分区中的每一个分配给节点簇中的节点; 对于每个接收到的文档,分配包括整数文档标识符的分配文档ID; 并且响应于将分配的文档ID分配给文档,确定新文档的分配到当前虚拟索引时期的截断,该当前虚拟索引时期包括第一组物理分区,并将新文档放入新的虚拟 - 指数 - 历元包括第二组物理分区,通过使用一个或多个基于所分配的文档ID中的一个来指导所述布局的功能,将每个新文档插入第二组中的特定一个物理分区 从文档获得的一组字段中导出的值以及分配的doc-id和字段值的组合。

    Generating and using a dynamic bloom filter
    18.
    发明授权
    Generating and using a dynamic bloom filter 失效
    生成和使用动态布局过滤器

    公开(公告)号:US08209368B2

    公开(公告)日:2012-06-26

    申请号:US12134148

    申请日:2008-06-05

    IPC分类号: G06F17/10

    CPC分类号: G06F12/0864

    摘要: A dynamic Bloom filter comprises a cascaded set of Bloom filters. The system estimates or guesses a cardinality of input items, selects a number of hash functions based on the desired false positive rate, and allocates memory for an initial Bloom filter based on the estimated cardinality and desired false positive rate. The system inserts items into the initial Bloom filter and counts the bits set as they are inserted. If the number of bits set in the current Bloom filter reaches a predetermined target, the system declares the current Bloom filter full. The system recursively generates additional Bloom filters as needed for items remaining after the initial Bloom filter is filled; items are checked to eliminate duplicates. Each of the set of Bloom filters is individually queried to identify a positive or negative in response to a query. When the system is configured such that the false positive rate of each successive Bloom filter is decreased by one half, the system guarantees a false positive rate of at most twice the desired false positive rate.

    摘要翻译: 一个动态的Bloom过滤器包括一个级联的Bloom过滤器。 系统估计或猜测输入项的基数,基于所需的假阳性率选择多个散列函数,并且基于估计的基数和期望的假阳性率为初始布隆过滤器分配存储器。 系统将项目插入到初始布隆过滤器中,并对插入的位进行计数。 如果当前布隆过滤器中设置的位数达到预定目标,则系统将声明当前布隆过滤器已满。 系统会根据需要在初始布隆过滤器填充后剩余的项目递归地生成其他布隆过滤器; 检查项目以消除重复。 每一组Bloom过滤器都被单独查询以识别响应于查询的正或负。 当系统被配置为使得每个连续的Bloom过滤器的假阳性率减少一半时,系统保证假阳性率为期望假阳性率的两倍。

    INDEX PARTITION MAINTENANCE OVER MONOTONICALLY ADDRESSED DOCUMENT SEQUENCES
    19.
    发明申请
    INDEX PARTITION MAINTENANCE OVER MONOTONICALLY ADDRESSED DOCUMENT SEQUENCES 有权
    索引分割维护在单个寻址的文档序列中

    公开(公告)号:US20120059823A1

    公开(公告)日:2012-03-08

    申请号:US12875615

    申请日:2010-09-03

    IPC分类号: G06F17/30

    摘要: Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.

    摘要翻译: 提供了用于将物理索引分割成一个或多个物理分区的技术; 将一个或多个物理分区中的每一个分配给节点簇中的节点; 对于每个接收到的文档,分配包括整数文档标识符的分配文档ID; 并且响应于将分配的文档ID分配给文档,确定新文档的分配到当前虚拟索引时期的截断,该当前虚拟索引时期包括第一组物理分区,并将新文档放入新的虚拟 - 指数 - 历元包括第二组物理分区,通过使用一个或多个基于所分配的文档ID中的一个来指导所述布局的功能,将每个新文档插入第二组中的特定一个物理分区 从文档获得的一组字段中导出的值以及分配的doc-id和字段值的组合。

    System and method for generating a cache-aware bloom filter
    20.
    发明授权
    System and method for generating a cache-aware bloom filter 失效
    用于生成缓存感知的布隆过滤器的系统和方法

    公开(公告)号:US08032732B2

    公开(公告)日:2011-10-04

    申请号:US12134125

    申请日:2008-06-05

    IPC分类号: G06F12/00

    CPC分类号: G06F17/10

    摘要: A cache-aware Bloom filter system segments a bit vector of a cache-aware Bloom filter into fixed-size blocks. The system hashes an item to be inserted into the cache-aware Bloom filter to identify one of the fixed-size blocks as a selected block for receiving the item and hashes the item k times to generate k hashed values for encoding the item for insertion in the in the selected block. The system sets bits within the selected block with addresses corresponding to the k hashed values such that accessing the item in the cache-aware Bloom filter requires accessing only the selected block to check the k hashed values. The size of the fixed-size block corresponds to a cache-line size of an associated computer architecture on which the cache-aware Bloom filter is installed.

    摘要翻译: 一个缓存感知的Bloom过滤器系统将缓存感知的Bloom过滤器的位向量分成固定大小的块。 系统将要插入到缓存感知的布隆过滤器中的项目进行散列,以将固定大小块之一识别为用于接收项目的选定块,并将项目k次哈希,以产生用于编码项目以插入的k个哈希值 在所选的块中。 系统在所选择的块内设置与k个哈希值相对应的地址的位,使得访问缓存感知的Bloom过滤器中的项目只需要访问所选择的块来检查k个哈希值。 固定大小块的大小对应于其上安装有缓存感知布隆过滤器的关联计算机体系结构的高速缓存行大小。