Identifying nearest neighbors for machine translation
    1.
    发明授权
    Identifying nearest neighbors for machine translation 有权
    识别机器翻译的最近邻居

    公开(公告)号:US08175864B1

    公开(公告)日:2012-05-08

    申请号:US12060126

    申请日:2008-03-31

    申请人: Moshe Dubiner

    发明人: Moshe Dubiner

    IPC分类号: G06F17/28 G06F17/27

    CPC分类号: G06F17/2827 G06F17/2818

    摘要: This specification describes technologies relating to identifying nearest neighbors are provided. In one implementation, a method includes using a first and a second collections of n-grams and their associated probabilities to generate a plurality of randomized ranked collections of n-grams of each of the first natural language and the second natural language, each ranked collection of n-grams of the plurality of randomized ranked collection of n-grams having an ordering of n-grams according to a rarity of the n-grams in the respective first collection and the second collection of n-grams; using each of the plurality of ranked collections of n-grams to determine a plurality of signatures, each signature corresponding to a text of a collection of texts; and using the plurality of signatures to identify candidate text pairs within the collection of texts including a plurality of texts in the first and the second natural languages.

    摘要翻译: 本规范描述了与识别最近邻居有关的技术。 在一个实现中,一种方法包括使用n-gram的第一和第二集合及其相关联的概率来生成第一自然语言和第二自然语言中的每一个的n-gram的多个随机排列的集合,每个排名的集合 根据相应的第一集合中的n-gram的稀有度和n-gram的第二集合,n-g个多个随机排列的n克集合具有n-gram的排序; 使用所述多个排列的n克集合中的每一个来确定多个签名,每个签名对应于文本集合的文本; 以及使用所述多个签名来识别所述文本集合内的候选文本对,所述候选文本对包括所述第一和第二自然语言中的多个文本。

    PARALLEL DOCUMENT MINING
    2.
    发明申请
    PARALLEL DOCUMENT MINING 审中-公开
    并行文件采矿

    公开(公告)号:US20120047172A1

    公开(公告)日:2012-02-23

    申请号:US13214941

    申请日:2011-08-22

    IPC分类号: G06F17/30

    CPC分类号: G06F17/2827 G06F16/30

    摘要: A technique includes providing a collection of documents in multiple languages, identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares multiple corresponding rare features, evaluating pairs of candidate documents in the group using multiple common features present in the collection of documents, and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.

    摘要翻译: 一种技术包括提供多种语言的文档集合,从文档的收集中识别一组候选文件,其中组中的每个候选文档共享多个对应的稀有特征,使用多个共同的方法评估该组中候选文档的对 在文件收集中存在的特征,以及基于评估候选文件对来确定每对候选文档是否对应于已翻译的一对文档。