-
公开(公告)号:US10460040B2
公开(公告)日:2019-10-29
申请号:US15194249
申请日:2016-06-27
Applicant: Facebook, Inc.
Inventor: Matthias Gerhard Eck
Abstract: Exemplary embodiments relate to techniques for improving machine translation systems. The machine translation system may apply one or more models for translating material from a source language into a destination language. The models are initially trained using training data. According to exemplary embodiments, supplemental training data is used to train the models, where the supplemental training data uses in-domain material to improve the quality of output translations. In-domain data may include data that relates to the same or similar topics as those expected to be encountered in a translation of material from the source language into the destination language. In-domain data may include material previously translated from the source language into the destination language, material similar to previous translations, and destination language material that has previously been the subject of a request for translation into the source language.
-
公开(公告)号:US09916299B2
公开(公告)日:2018-03-13
申请号:US15416186
申请日:2017-01-26
Applicant: Facebook, Inc.
Inventor: Matthias Gerhard Eck
CPC classification number: G06F17/274 , G06F17/218 , G06F17/2715 , G06F17/273 , G06F17/28 , G06F17/2818 , G06N5/022 , G06Q50/01 , H04L51/32
Abstract: Technology is disclosed that improves language coverage by selecting sentences to be used as training data for a language processing engine. The technology accomplishes the selection of a number of sentences by obtaining a group of sentences, computing a score for each sentence, sorting the sentences based on their scores, and selecting a number of sentences with the highest scores. The scores can be computed by dividing a sum of frequency values of unseen words (or n-grams) in the sentence by a length of the sentence. The frequency values can be based on posts in one or more particular domains, such as the public domain, the private domain, or other specialized domains.
-
公开(公告)号:US10318640B2
公开(公告)日:2019-06-11
申请号:US15192076
申请日:2016-06-24
Applicant: Facebook, Inc.
Inventor: William Arthur Hughes , Matthias Gerhard Eck , Kay Rottmann
Abstract: Exemplary embodiments provide techniques for evaluating when words or phrases of a translation were generated with a low degree of confidence, and conveying this information when the translation is presented. For example, if a source language word is encountered in source material for translation, but the source language word was only encountered a few times (or not at all) in the training data used to train the translation system, then the resulting translation may be flagged as being of low confidence. Other situations, such as the generation of two equally-likely translations, or translation system model disagreement, may also indicate a questionable translation. When the translation is displayed, questionable words and phrases may be flagged, and possible alternative translations may be presented. If one of the alternatives is selected, this information may be used to update the translation system's models in order to improve translation quality in the future.
-
公开(公告)号:US20190018837A1
公开(公告)日:2019-01-17
申请号:US15868970
申请日:2018-01-11
Applicant: Facebook, Inc.
Inventor: Juan Miguel Pino , Matthias Gerhard Eck , Rui Andre Augusto Ferreira
IPC: G06F17/27
CPC classification number: G06F17/2775 , G06F17/273
Abstract: Technology is disclosed for building correction models that correct natural language snippets. Correction models can include rules comprising pairs of word sequences identified from viable correction snippet pairs, where a first sequence of words in the pair should be replaced with a second sequence of words in the pair. Viable correction snippet pairs can be identified from among pairs of language snippets, such as a post to a social media website and a subsequent update to that post. Viable corrections can be the snippet pairs that both have no more unaligned words than a word alignment threshold and have no aligned word pair with a character edit difference above an edit distance threshold. In some implementations, word alignments can be found by aligning all the characters between a pair of language snippets, and identifying aligned words as those that have at least one aligned letter in common.
-
公开(公告)号:US09904672B2
公开(公告)日:2018-02-27
申请号:US14788679
申请日:2015-06-30
Applicant: Facebook, Inc.
Inventor: Juan Miguel Pino , Matthias Gerhard Eck , Rui Andre Augusto Ferreira
IPC: G06F17/27
CPC classification number: G06F17/2775 , G06F17/273
Abstract: Technology is disclosed for building correction models that correct natural language snippets. Correction models can include rules comprising pairs of word sequences identified from viable correction snippet pairs, where a first sequence of words in the pair should be replaced with a second sequence of words in the pair. Viable correction snippet pairs can be identified from among pairs of language snippets, such as a post to a social media website and a subsequent update to that post. Viable corrections can be the snippet pairs that both have no more unaligned words than a word alignment threshold and have no aligned word pair with a character edit difference above an edit distance threshold. In some implementations, word alignments can be found by aligning all the characters between a pair of language snippets, and identifying aligned words as those that have at least one aligned letter in common.
-
公开(公告)号:US20170004120A1
公开(公告)日:2017-01-05
申请号:US14788578
申请日:2015-06-30
Applicant: Facebook, Inc.
Inventor: Matthias Gerhard Eck , Fei Huang , Kay Rottmann
CPC classification number: G06F17/2775 , G06F17/273
Abstract: Technology is disclosed for correcting items containing natural language words that match qualified corrections. Qualified corrections can be identified from language snippet sets, which can include, for example, a post to a social media website and one or more updates to that post. Qualified corrections can be word pairs identified in one of these language snippet sets by aligning words between the language snippets according to a minimum word edit distance and computing that the word edit distance is below a first threshold. Based on this word alignment, word pairs can be selected and analyzed to identify qualified corrections as the word pairs that have a minimum character edit distance below a second threshold. In some cases, such as where both words in the qualified correction word pair are known words, a context can be associated with the qualified correction to control when the qualified correction should be applied.
Abstract translation: 公开了用于校正包含符合合格更正的自然语言单词的项目的技术。 可以从语言片段集中识别合格的更正,例如,可以将社交媒体网站的帖子和该帖子的一个或多个更新。 通过根据最小单词编辑距离对准语言片段之间的单词并计算单词编辑距离低于第一阈值,可以通过这些语言片段集合之一识别的合格校正。 基于该字对齐,可以选择和分析字对以将合格的校正识别为具有低于第二阈值的最小字符编辑距离的字对。 在某些情况下,例如在合格校正字对中的两个字都是已知字的情况下,上下文可以与合格校正相关联,以便在应用合格校正时进行控制。
-
公开(公告)号:US10474751B2
公开(公告)日:2019-11-12
申请号:US15868970
申请日:2018-01-11
Applicant: Facebook, Inc.
Inventor: Juan Miguel Pino , Matthias Gerhard Eck , Rui Andre Augusto Ferreira
IPC: G06F17/27
Abstract: Technology is disclosed for building correction models that correct natural language snippets. Correction models can include rules comprising pairs of word sequences identified from viable correction snippet pairs, where a first sequence of words in the pair should be replaced with a second sequence of words in the pair. Viable correction snippet pairs can be identified from among pairs of language snippets, such as a post to a social media website and a subsequent update to that post. Viable corrections can be the snippet pairs that both have no more unaligned words than a word alignment threshold and have no aligned word pair with a character edit difference above an edit distance threshold. In some implementations, word alignments can be found by aligning all the characters between a pair of language snippets, and identifying aligned words as those that have at least one aligned letter in common.
-
公开(公告)号:US10268686B2
公开(公告)日:2019-04-23
申请号:US15192170
申请日:2016-06-24
Applicant: Facebook, Inc.
Inventor: Matthias Gerhard Eck , Priya Goyal
Abstract: Exemplary embodiments relate to detecting, removing, and/or replacing objectionable words and phrases in a machine-generated translation. A classifier identifies translations containing target words or phrases. The classifier may be applied to the output translation to remove target words and phrases from the translation, or to prevent target words and phrases from being automatically presented. Further, the classifier may be applied to a translation model to prevent the target words and phrases from appearing in the output translation. Still further, the classifier may be applied to training data so that the translation model is not trained using the target words of phrases. The classifier may remove target words or phrases only when the target words or phrases appear in the output translation but not the source language input data. The classifier may be provided as a standalone service, or may be employed in the context of a machine translation system.
-
公开(公告)号:US20180089178A1
公开(公告)日:2018-03-29
申请号:US15823492
申请日:2017-11-27
Applicant: Facebook, Inc.
Inventor: Matthias Gerhard Eck , Ying Zhang , Yury Andreyevich Zemlyanskiy , Alexander Waibel
CPC classification number: G06F17/289 , G06F16/951 , G06F17/2818 , G06F17/2827
Abstract: Technology is disclosed for mining training data to create machine translation engines. Training data can be mined as translation pairs from single content items that contain multiple languages; multiple content items in different languages that are related to the same or similar target; or multiple content items that are generated by the same author in different languages. Locating content items can include identifying potential sources of translation pairs that fall into these categories and applying filtering techniques to quickly gather those that are good candidates for being actual translation pairs. When actual translation pairs are located, they can be used to retrain a machine translation engine as in-domain for social media content items.
-
公开(公告)号:US20170371867A1
公开(公告)日:2017-12-28
申请号:US15192076
申请日:2016-06-24
Applicant: Facebook, Inc.
Inventor: William Arthur Hughes , Matthias Gerhard Eck , Kay Rottmann
CPC classification number: G06F17/2854 , G06F17/2818
Abstract: Exemplary embodiments provide techniques for evaluating when words or phrases of a translation were generated with a low degree of confidence, and conveying this information when the translation is presented. For example, if a source language word is encountered in source material for translation, but the source language word was only encountered a few times (or not at all) in the training data used to train the translation system, then the resulting translation may be flagged as being of low confidence. Other situations, such as the generation of two equally-likely translations, or translation system model disagreement, may also indicate a questionable translation. When the translation is displayed, questionable words and phrases may be flagged, and possible alternative translations may be presented. If one of the alternatives is selected, this information may be used to update the translation system's models in order to improve translation quality in the future.
-
-
-
-
-
-
-
-
-