DETECTING DUPLICATE AND NEAR-DUPLICATE FILES
    21.
    发明申请
    DETECTING DUPLICATE AND NEAR-DUPLICATE FILES 审中-公开
    检测重复和近似文件

    公开(公告)号:US20120078871A1

    公开(公告)日:2012-03-29

    申请号:US13313913

    申请日:2011-12-07

    IPC分类号: G06F17/30

    摘要: Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

    摘要翻译: 改进的重复和近似重复的检测技术可以通过(i)从文档中提取部分,(ii)将提取的部分分配给预定数目的列表中的一个或多个来分配给定文档的许多指纹,以及(iii) 从每个填充列表生成指纹。 如果任何一个指纹匹配,两个文件可能被认为是近似重复的。

    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    24.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US07996402B1

    公开(公告)日:2011-08-09

    申请号:US12872755

    申请日:2010-08-31

    IPC分类号: G06F17/30

    摘要: A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.

    摘要翻译: 系统执行跨语言查询翻译。 系统接收包括第一语言的搜索查询,并确定搜索查询的条款可能的翻译成第二语言。 该系统还将用作并行语料库的文档定位为通过以下方式帮助翻译:(1)以包含与搜索查询的条款匹配的引用的第一语言定位文档,并识别第二语言的文档; (2)以包含与查询条款相匹配的引用的第一语言定位文件,并引用第一语言的其他文档,并且识别包含对其他文档的引用的第二语言的文档; 或者(3)以符合查询条款的第一语言定位文档,并且识别第二语言中包含对第一语言文档的引用的文档。 系统可以使用第二语言文档作为并行语料库来消除搜索查询的术语的可能的翻译之间的歧义,并将可能的翻译之一识别为搜索查询到第二语言的可能的翻译。

    Detecting duplicate and near-duplicate files
    25.
    发明授权
    Detecting duplicate and near-duplicate files 有权
    检测重复和近似重复的文件

    公开(公告)号:US07366718B1

    公开(公告)日:2008-04-29

    申请号:US10608468

    申请日:2003-06-27

    IPC分类号: G06F7/00 G06F17/30

    摘要: Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

    摘要翻译: 改进的重复和近似重复的检测技术可以通过(i)从文档中提取部分,(ii)将提取的部分分配给预定数目的列表中的一个或多个来分配给定文档的许多指纹,以及(iii) 从每个填充列表生成指纹。 如果任何一个指纹匹配,两个文件可能被认为是近似重复的。

    Connectivity server for locating linkage information between Web pages
    27.
    发明授权
    Connectivity server for locating linkage information between Web pages 失效
    用于在网页之间查找链接信息的连接服务器

    公开(公告)号:US6073135A

    公开(公告)日:2000-06-06

    申请号:US37350

    申请日:1998-03-10

    IPC分类号: G06F17/30

    摘要: A server computer is provided for representing and navigating the connectivity of Web pages. The Web pages include links to other Web pages. The links and Web page s have associated names (URLs). The names of the Web pages are sorted in a memory of the connectivity server. The sorted names are delta encoded while periodically storing full names as checkpoints in the memory. Each delta encoded name and checkpoint has a unique identification. A list of pairs of identifications representing existent links is sorted twice, first according to the first identification of each pair to produce an inlist, and second according to the second identification of each pair to produce an outlist. An array of elements is stored in the memory, there is one array element for each Web page. Each element includes a first pointer to one of the checkpoints, a second pointer to an associated inlist of the Web page, and a third pointer to an associated outlist of the Web page. The array is indexed by a particular identification to locate connected Web pages.

    摘要翻译: 提供服务器计算机用于表示和浏览网页的连接。 网页包含指向其他网页的链接。 链接和网页都有相关联的名称(URL)。 网页的名称在连接服务器的内存中排序。 排序的名称是增量编码的,同时周期性地将全名作为检查点存储在内存中。 每个delta编码的名称和检查点都有唯一的标识。 代表存在的链接的标识对的列表被分类两次,首先根据每对的第一个标识来产生一个列表,其次是根据每一对的第二个标识来产生一个列表。 元素数组存储在内存中,每个网页有一个数组元素。 每个元素包括指向其中一个检查点的第一指针,指向该网页的相关联列表的第二指针,以及指向该网页的相关联的列表的第三指针。 该阵列由特定的标识索引,以定位连接的网页。

    Finding web pages relevant to multimedia streams
    30.
    发明授权
    Finding web pages relevant to multimedia streams 有权
    查找与多媒体流相关的网页

    公开(公告)号:US08868543B1

    公开(公告)日:2014-10-21

    申请号:US10408784

    申请日:2003-04-08

    IPC分类号: G06F7/00

    CPC分类号: G06F17/30864 G06F17/30867

    摘要: A media stream, such as a news broadcast, is supplemented with documents that are relevant to the media stream. The documents may be web pages returned from a search engine. A search query generation component generates search queries for the search engine based on the media stream. A post processing component may re-rank and/or filter the documents to enhance the viewing experience for the user.

    摘要翻译: 诸如新闻广播的媒体流补充有与媒体流相关的文档。 文档可以是从搜索引擎返回的网页。 搜索查询生成组件基于媒体流生成搜索引擎的搜索查询。 后处理组件可以重新排序和/或过滤文档以增强用户的观看体验。