Generating content snippets using a tokenspace repository
    1.
    发明授权
    Generating content snippets using a tokenspace repository 有权
    使用令牌空间存储库生成内容片段

    公开(公告)号:US08321445B2

    公开(公告)日:2012-11-27

    申请号:US13040220

    申请日:2011-03-03

    IPC分类号: G06F7/00 G06F17/30

    摘要: A search engine server system receives from a client system a search query and identifies a set of documents in accordance with the search query. A content snippet corresponding to content in a respective document of the identified set of documents is generated, the content snippet associated with at least one query term of the one or more query terms in the search query. A response to the search query is returned to the client system, the response including information identifying at least the respective document and including the content snippet. Generating the content snippet includes performing a first decompression operation on first token identifiers, from a compressed document repository, to provide a set of second token identifiers, and performing a second decompression operation on the set of second token identifiers to recover uncompressed content comprising a portion of the respective document.

    摘要翻译: 搜索引擎服务器系统从客户端系统接收搜索查询,并根据搜索查询识别一组文档。 产生对应于所识别的一组文档的相应文档中的内容的内容片段,该内容片段与搜索查询中的一个或多个查询词的至少一个查询词相关联。 对搜索查询的响应被返回到客户端系统,响应包括至少标识相应文档并且包括内容片段的信息。 生成内容片段包括对来自压缩文档库的第一令牌标识符执行第一解压缩操作,以提供一组第二令牌标识符,以及对所述第二令牌标识符集合执行第二解压缩操作,以恢复未压缩内容,其包括部分 的相关文件。

    Query Processing System and Method for Use with Tokenspace Repository
    2.
    发明申请
    Query Processing System and Method for Use with Tokenspace Repository 有权
    查询处理系统和方法用于Tokenpace存储库

    公开(公告)号:US20110153577A1

    公开(公告)日:2011-06-23

    申请号:US13040220

    申请日:2011-03-03

    IPC分类号: G06F17/30

    摘要: A search engine server system receives from a client system a search query and identifies a set of documents in accordance with the search query. A content snippet corresponding to content in a respective document of the identified set of documents is generated, the content snippet associated with at least one query term of the one or more query terms in the search query. A response to the search query is returned to the client system, the response including information identifying at least the respective document and including the content snippet. Generating the content snippet includes performing a first decompression operation on first token identifiers, from a compressed document repository, to provide a set of second token identifiers, and performing a second decompression operation on the set of second token identifiers to recover uncompressed content comprising a portion of the respective document.

    摘要翻译: 搜索引擎服务器系统从客户端系统接收搜索查询,并根据搜索查询识别一组文档。 产生对应于所识别的一组文档的相应文档中的内容的内容片段,该内容片段与搜索查询中的一个或多个查询词的至少一个查询词相关联。 对搜索查询的响应被返回到客户端系统,响应包括至少标识相应文档并且包括内容片段的信息。 生成内容片段包括对来自压缩文档库的第一令牌标识符执行第一解压缩操作,以提供一组第二令牌标识符,以及对所述第二令牌标识符集合执行第二解压缩操作,以恢复未压缩内容,其包括部分 的相关文件。

    Generating Content Snippets Using a Tokenspace Repository
    3.
    发明申请
    Generating Content Snippets Using a Tokenspace Repository 有权
    使用令牌空间存储库生成内容片段

    公开(公告)号:US20130212076A1

    公开(公告)日:2013-08-15

    申请号:US13685581

    申请日:2012-11-26

    IPC分类号: G06F17/30

    摘要: A search engine server system receives from a client system a search query and identifies a set of documents in accordance with the search query. A content snippet corresponding to content in a respective document of the identified set of documents is generated, the content snippet associated with at least one query term of the one or more query terms in the search query. A response to the search query is returned to the client system, the response including information identifying at least the respective document and including the content snippet. Generating the content snippet includes performing a first decompression operation on first token identifiers, from a compressed document repository, to provide a set of second token identifiers, and performing a second decompression operation on the set of second token identifiers to recover uncompressed content comprising a portion of the respective document.

    摘要翻译: 搜索引擎服务器系统从客户端系统接收搜索查询,并根据搜索查询识别一组文档。 产生对应于所识别的一组文档的相应文档中的内容的内容片段,该内容片段与搜索查询中的一个或多个查询词的至少一个查询词相关联。 对搜索查询的响应被返回到客户端系统,响应包括至少标识相应文档并且包括内容片段的信息。 生成内容片段包括对来自压缩文档库的第一令牌标识符执行第一解压缩操作,以提供一组第二令牌标识符,以及对所述第二令牌标识符集合执行第二解压缩操作,以恢复未压缩内容,其包括部分 的相关文件。

    Detecting query-specific duplicate documents
    4.
    发明授权
    Detecting query-specific duplicate documents 有权
    检测特定于查询的重复文档

    公开(公告)号:US07779002B1

    公开(公告)日:2010-08-17

    申请号:US10602965

    申请日:2003-06-24

    IPC分类号: G06F7/00

    摘要: An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as “snippets”) is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.

    摘要翻译: 描述了使用查询相关信息来限制要比较相似性的文档的部分的改进的重复检测技术。 在比较两个文档的相似性之前,这些文档的内容可能会根据查询进行浓缩。 在一个实施例中,从文档中提取查询相关信息或文本(也称为“片段”),并且为了确定相似性而仅比较所提取的片段而不是整个文档。

    Document compression system and method for use with tokenspace repository
    5.
    发明授权
    Document compression system and method for use with tokenspace repository 有权
    文档压缩系统和方法用于托管存储库

    公开(公告)号:US07917480B2

    公开(公告)日:2011-03-29

    申请号:US10917739

    申请日:2004-08-13

    IPC分类号: G06F7/00 G06F17/00 G06F15/18

    摘要: The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.

    摘要翻译: 所公开的实施例通过由多层映射方案促进的增量文档重建能够实现多阶段查询评分,包括“代码段”生成。 映射方案包括包含在一组文档中的唯一标记和包含在全局词典(即字典)中的唯一全局令牌标识符(例如,32位整数)之间的第一映射。 映射方案还包括全局令牌标识符与包含在一个或多个小词典(即子词典)中的一组固定长度的本地令牌标识符(例如,8位整数)之间的第二映射。 每个迷你词典与令牌化文档中的一系列令牌位置相关联。 第一和第二映射用于将文档编码/解码为具有固定宽度的本地令牌标识符,其可以紧凑地存储在令牌空间存储库中。 使用固定长度的本地令牌标识符可以快速有效地解码标记化文档。