Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    7.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US08631010B1

    公开(公告)日:2014-01-14

    申请号:US13474957

    申请日:2012-05-18

    IPC分类号: G06F17/30

    摘要: A method may include obtaining, based on a content of a search query, one or more documents in a first language; identifying one or more documents in a second language that contain an anchor that links to the one or more documents in the first language, the second language being different than the first language; and translating one or more terms of the search query into the second language using content included in the one or more documents in the second language.

    摘要翻译: 方法可以包括基于搜索查询的内容获得第一语言中的一个或多个文档; 以第二语言识别包含链接到所述第一语言中的一个或多个文档的锚的一个或多个文档,所述第二语言不同于所述第一语言; 以及使用所述第二语言中的一个或多个文档中包含的内容将所述搜索查询的一个或多个术语翻译成所述第二语言。

    Detecting Duplicate and Near-Duplicate Files
    8.
    发明申请
    Detecting Duplicate and Near-Duplicate Files 审中-公开
    检测重复和近重复文件

    公开(公告)号:US20120290597A1

    公开(公告)日:2012-11-15

    申请号:US13225342

    申请日:2011-09-02

    IPC分类号: G06F17/30

    CPC分类号: G06F17/2211 G06F16/958

    摘要: Near-duplicate documents may be identified by (a) accepting a set of documents, (b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, and (c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique. The first document similarity technique might be token order dependent, and the second document similarity technique might be order independent. The first document similarity technique might be token frequency independent, and the second document similarity technique might be frequency dependent. The first document similarity technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second document similarity technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first document similarity technique might use set intersection to determine whether or not documents are near-duplicates, and the second document similarity technique might use random projections to determine whether or not documents are near-duplicates.

    摘要翻译: 可以通过以下方式来识别近似重复的文档:(a)接收一组文档,(b)使用第一文档相似性技术来处理所述一组文档以确定第一组近似重复的文档,以及(c)处理所述第一组 使用第二文档相似性技术来确定第二组近似重复的文档。 第一个文档相似性技术可能是令牌顺序相关的,第二个文档相似性技术可能是独立的。 第一个文档相似性技术可能是令牌频率无关的,第二个文档相似性技术可能是频率依赖的。 第一文档相似性技术可以基于文档的单词或令牌的子集来确定两个文档是否是近似重复的,并且第二文档相似性技术可以基于所有文档的表示来确定两个文档是否是近似重复的 文件的单词或令牌。 第一种文档相似性技术可能使用集合交集来确定文档是否是近似重复的,并且第二文档相似性技术可以使用随机投影来确定文档是否是重复的。

    Search queries improved based on query semantic information
    9.
    发明授权
    Search queries improved based on query semantic information 有权
    基于查询语义信息改进搜索查询

    公开(公告)号:US08055669B1

    公开(公告)日:2011-11-08

    申请号:US10377117

    申请日:2003-03-03

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/3064

    摘要: A search query for a search engine may be improved by incorporating alternate terms into the search query that are semantically similar to terms of the search query, taking into account information derived from the search query. An initial set of alternate terms that may be semantically similar to the original terms in the search query is generated. The initial set of alternate terms may be compared to information derived from the original search query. One example of such information is a set of documents retrieved in response to a search performed using the initial search query. One or more of the alternate terms may be added to the original search query based on their relationship to the information derived from the original search query.

    摘要翻译: 可以考虑到从搜索查询导出的信息,通过将语法上与搜索查询的术语相似的搜索查询中的替换项合并来来改进搜索引擎的搜索查询。 生成可能在语义上类似于搜索查询中的原始术语的初始替代项集合。 可以将初始替代项集合与从原始搜索查询导出的信息进行比较。 这种信息的一个示例是响应于使用初始搜索查询执行的搜索而检索的一组文档。 可以根据与原始搜索查询导出的信息的关系将一个或多个替代术语添加到原始搜索查询。