Large scale data storage in sparse tables
    1.
    发明授权
    Large scale data storage in sparse tables 有权
    稀疏表中的大规模数据存储

    公开(公告)号:US07428524B2

    公开(公告)日:2008-09-23

    申请号:US11197925

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: Each of a plurality of data items is stored in a table data structure. A row identifier and column identifier are associated with each respective data item, and each respective item is stored at a logical location in the table data structure specified by its row identifier and column identifier. A plurality of data items is stored in a cell of the table data structure, and a timestamp is associated with each of the plurality of data items stored in the cell. Each of the data items stored in the cell has the same row identifier, the same column identifier, and a distinct timestamp. In some embodiments, each row identifier is a string of arbitrary length and arbitrary value. Similarly, in some embodiments each column identifier is a string of arbitrary length and arbitrary value.

    摘要翻译: 多个数据项中的每一个被存储在表数据结构中。 行标识符和列标识符与每个相应的数据项相关联,并且每个相应的项目被存储在由其行标识符和列标识符指定的表数据结构中的逻辑位置处。 多个数据项被存储在表数据结构的单元中,并且时间戳与存储在单元中的多个数据项中的每一个相关联。 存储在单元中的每个数据项具有相同的行标识符,相同的列标识符和不同的时间戳。 在一些实施例中,每个行标识符是任意长度和任意值的串。 类似地,在一些实施例中,每个列标识符是任意长度和任意值的串。

    Storing a sparse table using locality groups
    2.
    发明授权
    Storing a sparse table using locality groups 有权
    使用位置组存储稀疏表

    公开(公告)号:US07567973B1

    公开(公告)日:2009-07-28

    申请号:US11197924

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: Each of a plurality of data items is stored in a table data structure. The table structure includes a plurality of columns. Each of the columns is associated with one of a plurality of locality groups. Each locality group is stored as one or more corresponding locality group files that include the data items in the columns associated with the respective locality group. In some embodiments, the columns of the table data structure may be grouped into groups of columns and each group of columns is associated with one of a plurality of locality groups. Each locality group is stored as one or more corresponding locality group files that include the data items in the group of columns associated with the respective locality group.

    摘要翻译: 多个数据项中的每一个被存储在表数据结构中。 表结构包括多个列。 每个列与多个地点组中的一个相关联。 每个地点组被存储为一个或多个对应的地点组文件,其包括与相应地点组相关联的列中的数据项。 在一些实施例中,表数据结构的列可以被分组成列组,并且每组列与多个位置组之一相关联。 每个地点组被存储为一个或多个对应的位置组文件,其包括与相应位置组相关联的列组中的数据项。

    Representative document selection for a set of duplicate documents
    3.
    发明授权
    Representative document selection for a set of duplicate documents 有权
    代表文件选择一套重复的文件

    公开(公告)号:US08868559B2

    公开(公告)日:2014-10-21

    申请号:US13599707

    申请日:2012-08-30

    IPC分类号: G06F7/00 G06F17/30

    摘要: Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

    摘要翻译: 公开了从一组重复文件索引代表性文件的系统和方法。 公开的系统和方法包括在第一文档与查询独立分数相关联的基础上选择多个文档中的第一文档。 多个文档中的每个文档具有指示相应文档具有与多个文档中的每个其他文档基本相同的内容的指纹。 公开的系统和方法还包括根据查询独立分数索引第一文档,从而产生索引的第一文档。 对于多个文档,仅索引的第一文档被包括在文档索引中。

    Using text surrounding hypertext links when indexing and generating page summaries
    4.
    发明授权
    Using text surrounding hypertext links when indexing and generating page summaries 有权
    在索引和生成页面摘要时使用超文本链接的文本

    公开(公告)号:US08495483B1

    公开(公告)日:2013-07-23

    申请号:US10386110

    申请日:2003-03-12

    IPC分类号: G06F17/00 G06F17/30

    CPC分类号: G06F17/30864

    摘要: Web quotes are gathered from web pages that link to a web page of interest. The web quote may include text from the paragraphs that contain the hypertext links to the page of interest as well as text from other portions of the linked web page, such as text from a nearby header. The obtained web quotes may be ranked based on quality or relevance and may then be incorporated into a search engine's document index or into summary information returned to users in response to a search query.

    摘要翻译: 网络引用从链接到感兴趣的网页的网页收集。 网络报价可以包括来自包含到感兴趣页面的超文本链接的段落的文本以及链接网页的其他部分的文本,例如来自附近标题的文本。 获得的网络报价可以基于质量或相关性来排序,然后可以被合并到搜索引擎的文档索引中或者被合并到响应于搜索查询返回给用户的摘要信息中。

    REPRESENTATIVE DOCUMENT SELECTION FOR A SET OF DUPLICATE DOCUMENTS
    5.
    发明申请
    REPRESENTATIVE DOCUMENT SELECTION FOR A SET OF DUPLICATE DOCUMENTS 有权
    一组重复文件的代表性文件选择

    公开(公告)号:US20120323896A1

    公开(公告)日:2012-12-20

    申请号:US13599707

    申请日:2012-08-30

    IPC分类号: G06F17/30

    摘要: Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

    摘要翻译: 公开了从一组重复文件索引代表性文件的系统和方法。 公开的系统和方法包括在第一文档与查询独立分数相关联的基础上选择多个文档中的第一文档。 多个文档中的每个文档具有指示相应文档具有与多个文档中的每个其他文档基本相同的内容的指纹。 公开的系统和方法还包括根据查询独立分数索引第一文档,从而产生索引的第一文档。 对于多个文档,仅索引的第一文档被包括在文档索引中。

    Efficient indexing of documents with similar content
    8.
    发明授权
    Efficient indexing of documents with similar content 有权
    具有类似内容的文件的高效索引

    公开(公告)号:US08244530B2

    公开(公告)日:2012-08-14

    申请号:US13249136

    申请日:2011-09-29

    IPC分类号: G10L15/06

    CPC分类号: G06F17/3071

    摘要: A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.

    摘要翻译: 可以将一组文档存储并索引为压缩的令牌序列。 一组文档被分组成簇。 代表文档集群的令牌序列被编码,以清除令牌的一些重复实例。 从令牌的压缩簇序列生成令牌的压缩序列。 通过识别压缩序列中可能具有满足查询的文档,然后在这些标识的集群中识别实际满足查询的文档来执行对压缩序列的查询。

    Data compression of large scale data stored in sparse tables
    9.
    发明授权
    Data compression of large scale data stored in sparse tables 有权
    大量数据的数据压缩存储在稀疏表中

    公开(公告)号:US07548928B1

    公开(公告)日:2009-06-16

    申请号:US11197922

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: A method of compressing data in a table data structure begins by accessing a data set within the table data structure, the data set having associated therewith a range of rows of the table data structure. Data items in the data set are represented by key-value pairs. The method includes applying a first compression to the values of the key-value pairs in the data set to produce a first compressed output; applying a second compression, distinct from the first compression, to the keys of the key-value pairs in the data set to produce a second compressed output; and applying a third compression to the first compressed output and second compressed output to produce a first compressed output block, wherein the third compression is distinct from the first compression and second compression.

    摘要翻译: 一种压缩表格数据结构中的数据的方法是通过访问表格数据结构内的数据集开始的,该数据集与表数据结构中的一行行相关联。 数据集中的数据项由键值对表示。 该方法包括对数据集中的键值对的值应用第一压缩以产生第一压缩输出; 将不同于第一压缩的第二压缩应用于数据集中的键值对的键以产生第二压缩输出; 以及将第三压缩应用于所述第一压缩输出和所述第二压缩输出以产生第一压缩输出块,其中所述第三压缩与所述第一压缩和所述第二压缩不同。

    Distributed crawling of hyperlinked documents
    10.
    发明授权
    Distributed crawling of hyperlinked documents 有权
    分布式抓取超链接文档

    公开(公告)号:US08812478B1

    公开(公告)日:2014-08-19

    申请号:US13608598

    申请日:2012-09-10

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

    摘要翻译: 提供了用于爬行超链接文档的技术。 要爬网的超链接文档按主机分组,根据主机的停顿时间选择下一次要抓取的主机。 停机时间可以指示主机应该被抓取的最早时间,并且停机时间可以是预定的时间量,由主机变化,并且根据主机的实际检索时间进行调整。