Back-off language model compression
    11.
    发明授权
    Back-off language model compression 有权
    后退语言模型压缩

    公开(公告)号:US08725509B1

    公开(公告)日:2014-05-13

    申请号:US12486358

    申请日:2009-06-17

    CPC分类号: G10L15/183 G06F17/277

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, relating to language models stored for digital language processing. In one aspect, a method includes the actions of generating a language model, including: receiving a collection of n-grams from a corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus, and generating a trie representing the collection of n-grams, the trie being represented using one or more arrays of integers, and compressing an array representation of the trie using block encoding; and using the language model to identify a second probability of a particular string of words occurring.

    摘要翻译: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,与存储用于数字语言处理的语言模型有关。 一方面,一种方法包括生成语言模型的动作,包括:从语料库接收n-gram的集合,每个n-gram的集合具有在语料库中发生的对应的第一概率,并且生成特征 代表n克的集合,使用一个或多个整数数组来表示特里,并使用块编码压缩该特征的阵列表示; 并使用语言模型来识别发生的特定字符串串的第二概率。

    Efficient indexing of documents with similar content
    12.
    发明授权
    Efficient indexing of documents with similar content 有权
    具有类似内容的文件的高效索引

    公开(公告)号:US08554561B2

    公开(公告)日:2013-10-08

    申请号:US13571316

    申请日:2012-08-09

    IPC分类号: G10L15/06

    CPC分类号: G06F17/3071

    摘要: A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data.

    摘要翻译: 一种包括一个或多个处理器和存储器组的计算机系统,一组文档成为多个集群。 每个集群包括文档集合中的一个或多个文档,并且多个集群的相应文档集合包括对应于包括第一文档和第二文档的多个文档的相应集群数据。 计算机系统确定第二文档包括与第一文档中的对应数据重复的重复数据,识别排除重复数据的至少一个子集的相应集群数据的相应子集,并且生成相应子集的索引 的各个集群数据。

    Distributed crawling of hyperlinked documents
    13.
    发明授权
    Distributed crawling of hyperlinked documents 有权
    分布式抓取超链接文档

    公开(公告)号:US08266134B1

    公开(公告)日:2012-09-11

    申请号:US11923240

    申请日:2007-10-24

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

    摘要翻译: 提供了用于爬行超链接文档的技术。 要爬网的超链接文档按主机分组,根据主机的停机时间选择下一次要抓取的主机。 停机时间可以指示主机应该被抓取的最早时间,并且停机时间可以是预定的时间量,由主机变化,并且根据主机的实际检索时间进行调整。

    Efficient indexing of documents with similar content
    14.
    发明授权
    Efficient indexing of documents with similar content 有权
    具有类似内容的文件的高效索引

    公开(公告)号:US08244530B2

    公开(公告)日:2012-08-14

    申请号:US13249136

    申请日:2011-09-29

    IPC分类号: G10L15/06

    CPC分类号: G06F17/3071

    摘要: A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.

    摘要翻译: 可以将一组文档存储并索引为压缩的令牌序列。 一组文档被分组成簇。 代表文档集群的令牌序列被编码,以清除令牌的一些重复实例。 从令牌的压缩簇序列生成令牌的压缩序列。 通过识别压缩序列中可能具有满足查询的文档,然后在这些标识的集群中识别实际满足查询的文档来执行对压缩序列的查询。

    Data compression of large scale data stored in sparse tables
    15.
    发明授权
    Data compression of large scale data stored in sparse tables 有权
    大量数据的数据压缩存储在稀疏表中

    公开(公告)号:US07548928B1

    公开(公告)日:2009-06-16

    申请号:US11197922

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: A method of compressing data in a table data structure begins by accessing a data set within the table data structure, the data set having associated therewith a range of rows of the table data structure. Data items in the data set are represented by key-value pairs. The method includes applying a first compression to the values of the key-value pairs in the data set to produce a first compressed output; applying a second compression, distinct from the first compression, to the keys of the key-value pairs in the data set to produce a second compressed output; and applying a third compression to the first compressed output and second compressed output to produce a first compressed output block, wherein the third compression is distinct from the first compression and second compression.

    摘要翻译: 一种压缩表格数据结构中的数据的方法是通过访问表格数据结构内的数据集开始的,该数据集与表数据结构中的一行行相关联。 数据集中的数据项由键值对表示。 该方法包括对数据集中的键值对的值应用第一压缩以产生第一压缩输出; 将不同于第一压缩的第二压缩应用于数据集中的键值对的键以产生第二压缩输出; 以及将第三压缩应用于所述第一压缩输出和所述第二压缩输出以产生第一压缩输出块,其中所述第三压缩与所述第一压缩和所述第二压缩不同。

    Large scale data storage in sparse tables
    16.
    发明授权
    Large scale data storage in sparse tables 有权
    稀疏表中的大规模数据存储

    公开(公告)号:US07428524B2

    公开(公告)日:2008-09-23

    申请号:US11197925

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: Each of a plurality of data items is stored in a table data structure. A row identifier and column identifier are associated with each respective data item, and each respective item is stored at a logical location in the table data structure specified by its row identifier and column identifier. A plurality of data items is stored in a cell of the table data structure, and a timestamp is associated with each of the plurality of data items stored in the cell. Each of the data items stored in the cell has the same row identifier, the same column identifier, and a distinct timestamp. In some embodiments, each row identifier is a string of arbitrary length and arbitrary value. Similarly, in some embodiments each column identifier is a string of arbitrary length and arbitrary value.

    摘要翻译: 多个数据项中的每一个被存储在表数据结构中。 行标识符和列标识符与每个相应的数据项相关联,并且每个相应的项目被存储在由其行标识符和列标识符指定的表数据结构中的逻辑位置处。 多个数据项被存储在表数据结构的单元中,并且时间戳与存储在单元中的多个数据项中的每一个相关联。 存储在单元中的每个数据项具有相同的行标识符,相同的列标识符和不同的时间戳。 在一些实施例中,每个行标识符是任意长度和任意值的串。 类似地,在一些实施例中,每个列标识符是任意长度和任意值的串。

    Efficient Indexing of Documents with Similar Content
    17.
    发明申请
    Efficient Indexing of Documents with Similar Content 有权
    具有相似内容的文件的高效索引

    公开(公告)号:US20120303622A1

    公开(公告)日:2012-11-29

    申请号:US13571316

    申请日:2012-08-09

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3071

    摘要: A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data.

    摘要翻译: 一种包括一个或多个处理器和存储器组的计算机系统,一组文档成为多个集群。 每个集群包括文档集合中的一个或多个文档,并且多个集群的相应文档集合包括对应于包括第一文档和第二文档的多个文档的相应集群数据。 计算机系统确定第二文档包括与第一文档中的对应数据重复的重复数据,识别排除重复数据的至少一个子集的相应集群数据的相应子集,并且生成相应子集的索引 的各个集群数据。

    Efficient indexing of documents with similar content

    公开(公告)号:US08175875B1

    公开(公告)日:2012-05-08

    申请号:US11419423

    申请日:2006-05-19

    IPC分类号: G10L15/06

    CPC分类号: G06F17/3071

    摘要: A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.

    Storing a sparse table using locality groups
    19.
    发明授权
    Storing a sparse table using locality groups 有权
    使用位置组存储稀疏表

    公开(公告)号:US07567973B1

    公开(公告)日:2009-07-28

    申请号:US11197924

    申请日:2005-08-05

    IPC分类号: G06F17/30

    摘要: Each of a plurality of data items is stored in a table data structure. The table structure includes a plurality of columns. Each of the columns is associated with one of a plurality of locality groups. Each locality group is stored as one or more corresponding locality group files that include the data items in the columns associated with the respective locality group. In some embodiments, the columns of the table data structure may be grouped into groups of columns and each group of columns is associated with one of a plurality of locality groups. Each locality group is stored as one or more corresponding locality group files that include the data items in the group of columns associated with the respective locality group.

    摘要翻译: 多个数据项中的每一个被存储在表数据结构中。 表结构包括多个列。 每个列与多个地点组中的一个相关联。 每个地点组被存储为一个或多个对应的地点组文件,其包括与相应地点组相关联的列中的数据项。 在一些实施例中,表数据结构的列可以被分组成列组,并且每组列与多个位置组之一相关联。 每个地点组被存储为一个或多个对应的位置组文件,其包括与相应位置组相关联的列组中的数据项。

    Associating application-specific methods with tables used for data storage
    20.
    发明授权
    Associating application-specific methods with tables used for data storage 有权
    将应用程序特定方法与用于数据存储的表相关联

    公开(公告)号:US08484351B1

    公开(公告)日:2013-07-09

    申请号:US12247984

    申请日:2008-10-08

    IPC分类号: G06F15/173

    摘要: A method of accessing data includes storing a table that includes a plurality of tablets corresponding to distinct non-overlapping table portions. Respective pluralities of tablet access objects and application objects are stored in a plurality of servers. A distinct application object and distinct tablet are associated with each tablet access object. Each application object corresponds to a distinct instantiation of an application associated with the table. The tablet access objects and associated application objects are redistributed among the servers in accordance with a first load-balancing criterion. A first request directed to a respective tablet is received from a client. In response, the tablet access object associated with the respective tablet is used to perform a data access operation on the respective tablet, and the application object associated with the respective tablet is used to perform an additional computational operation to produce a result to be returned to the client.

    摘要翻译: 访问数据的方法包括存储包括对应于不同的非重叠表部分的多个片的表。 平板电脑访问对象和应用对象的多个存储在多个服务器中。 独特的应用对象和不同的平板电脑与每个平板电脑访问对象相关联。 每个应用程序对象对应于与该表相关联的应用程序的不同实例。 平板电脑访问对象和关联的应用程序对象根据第一个负载平衡标准在服务器之间重新分配。 从客户端接收到针对相应平板电脑的第一请求。 作为响应,与各个平板电脑相关联的平板电脑访问对象被用于在相应的平板电脑上执行数据访问操作,并且使用与各个平板电脑相关联的应用对象来执行附加的计算操作以产生要返回到 客户端。