Document compression system and method for use with tokenspace repository

发明申请

US20070220023A1 Document compression system and method for use with tokenspace repository 有权

标题翻译：文档压缩系统和方法用于托管存储库

请登陆查看更多内容

专利标题： Document compression system and method for use with tokenspace repository
专利标题（中）： 文档压缩系统和方法用于托管存储库
申请号： US10917739

申请日： 2004-08-13
公开(公告)号： US20070220023A1

公开(公告)日： 2007-09-20
发明人: Jeffrey Dean , Gautham Thambidorai , Sanjay Ghemawat , Benedict Gomes , Olcan Sercinoglu
申请人： Jeffrey Dean , Gautham Thambidorai , Sanjay Ghemawat , Benedict Gomes , Olcan Sercinoglu
主分类号： G06F7/00
IPC分类号： G06F7/00

Document compression system and method for use with tokenspace repository

摘要：

The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.

摘要（中）：

所公开的实施例通过由多层映射方案促进的增量文档重建能够实现多阶段查询评分，包括“代码段”生成。映射方案包括包含在一组文档中的唯一标记和包含在全局词典（即字典）中的唯一全局令牌标识符（例如，32位整数）之间的第一映射。映射方案还包括全局令牌标识符与包含在一个或多个小词典（即子词典）中的一组固定长度的本地令牌标识符（例如，8位整数）之间的第二映射。每个迷你词典与令牌化文档中的一系列令牌位置相关联。第一和第二映射用于将文档编码/解码为具有固定宽度的本地令牌标识符，其可以紧凑地存储在令牌空间存储库中。使用固定长度的本地令牌标识符可以快速有效地解码标记化文档。

公开/授权文献

US07917480B2 Document compression system and method for use with tokenspace repository 公开/授权日：2011-03-29

信息查询

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F7/00	通过待处理的数据的指令或内容进行运算的数据处理的方法或装置（逻辑电路入H03K19/00）