Chunk-based statistical machine translation system
    1.
    发明申请
    Chunk-based statistical machine translation system 审中-公开
    基于块的统计机器翻译系统

    公开(公告)号:US20080154577A1

    公开(公告)日:2008-06-26

    申请号:US11645926

    申请日:2006-12-26

    IPC分类号: G06F17/28

    CPC分类号: G06F17/2827 G06F17/2775

    摘要: Traditional statistical machine translation systems learn all information from a sentence aligned parallel text and are known to have problems translating between structurally diverse languages. To overcome this limitation, the present invention introduces two-level training, which incorporates syntactic chunking into statistical translation. A chunk-alignment step is inserted between the sentence-level and word-level training, which allows differing training for these two sources of information in order to learn lexical properties from the aligned chunks and learn structural properties from chunk sequences. The system consists of a linguistic processing step, two level training, and a decoding step which combines chunk translations of multiple sources and multiple language models.

    摘要翻译: 传统的统计机器翻译系统从句子对齐的并行文本中学习所有信息,并且已知在不同结构语言之间翻译有问题。 为了克服这个限制,本发明引入了将句法分块结合到统计翻译中的两级训练。 在句子级和词级训练之间插入块对齐步骤,其允许针对这两个信息源的不同训练,以便从对齐的块学习词汇属性并从块序列学习结构特性。 该系统由语言处理步骤,两级训练和解码步骤组成,该步骤结合了多个来源和多种语言模型的块转换。

    Robust information extraction from utterances
    2.
    发明授权
    Robust information extraction from utterances 有权
    从言语中提取鲁棒的信息

    公开(公告)号:US08583416B2

    公开(公告)日:2013-11-12

    申请号:US11965711

    申请日:2007-12-27

    IPC分类号: G06F17/28 G10L15/00 G10L21/00

    CPC分类号: G10L15/1822 G10L15/1815

    摘要: The performance of traditional speech recognition systems (as applied to information extraction or translation) decreases significantly with, larger domain size, scarce training data as well as under noisy environmental conditions. This invention mitigates these problems through the introduction of a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language. The predictive features are combined with text classifiers to map the noisy text to one of the semantically or functionally similar groups. The features used by the classifier can be syntactic, semantic, and statistical.

    摘要翻译: 传统语音识别系统(应用于信息提取或翻译)的性能随着更大的域大小,稀缺的训练数据以及噪声环境条件而显着降低。 本发明通过引入一种新颖的预测特征提取方法来缓解这些问题,该方法结合语言和统计信息来表示以噪声源语言嵌入的信息。 预测特征与文本分类器组合,将嘈杂的文本映射到语义或功能相似的组之一。 分类器使用的特征可以是语法,语义和统计。

    Robust Information Extraction from Utterances
    3.
    发明申请
    Robust Information Extraction from Utterances 有权
    强大的信息提取

    公开(公告)号:US20090171662A1

    公开(公告)日:2009-07-02

    申请号:US11965711

    申请日:2007-12-27

    IPC分类号: G10L15/00

    CPC分类号: G10L15/1822 G10L15/1815

    摘要: The performance of traditional speech recognition systems (as applied to information extraction or translation) decreases significantly with, larger domain size, scarce training data as well as under noisy environmental conditions. This invention mitigates these problems through the introduction of a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language. The predictive features are combined with text classifiers to map the noisy text to one of the semantically or functionally similar groups. The features used by the classifier can be syntactic, semantic, and statistical.

    摘要翻译: 传统语音识别系统(应用于信息提取或翻译)的性能随着更大的域大小,稀缺的训练数据以及噪声环境条件而显着降低。 本发明通过引入一种新颖的预测特征提取方法来缓解这些问题,该方法结合语言和统计信息来表示以噪声源语言嵌入的信息。 预测特征与文本分类器组合,将嘈杂的文本映射到语义或功能相似的组之一。 分类器使用的特征可以是语法,语义和统计。

    Methods for speech-to-speech translation
    4.
    发明申请
    Methods for speech-to-speech translation 审中-公开
    语言到语音翻译的方法

    公开(公告)号:US20080133245A1

    公开(公告)日:2008-06-05

    申请号:US11633859

    申请日:2006-12-04

    IPC分类号: G10L21/00 G06F17/28 G10L11/00

    摘要: The present invention disclose modular speech-to-speech translation systems and methods that provide adaptable platforms to enable verbal communication between speakers of different languages within the context of specific domains. The components of the preferred embodiments of the present invention includes: (1) speech recognition; (2) machine translation; (3) N-best merging module; (4) verification; and (5) text-to-speech. Characteristics of the speech recognition module here are that the modules are structured to provide N-best selections and multi-stream processing, where multiple speech recognition engines may be active at any one time. The N-best lists from the one or more speech recognition engines may be handled either separately or collectively to improve both recognition and translation results. A merge module is responsible for integrating the N-best outputs of the translation engines along with confidence/translation scores to create a ranked list or recognition-translation pairs.

    摘要翻译: 本发明公开了提供适应性平台的模块化语音到语音翻译系统和方法,以使得能够在特定域的上下文内的不同语言的说话者之间进行口头通信。 本发明的优选实施例的组件包括:(1)语音识别; (2)机器翻译; (3)最佳合并模块; (4)验证; (5)文字转语音。 这里的语音识别模块的特征在于,模块被构造成提供N个最佳选择和多流处理,其中多个语音识别引擎可以在任何一个时间处于活动状态。 来自一个或多个语音识别引擎的N最佳列表可以单独处理或集体处理以改善识别和翻译结果。 合并模块负责整合翻译引擎的N最佳输出以及置信/翻译分数,以创建排名列表或识别 - 转换对。

    Methods for using manual phrase alignment data to generate translation models for statistical machine translation
    5.
    发明授权
    Methods for using manual phrase alignment data to generate translation models for statistical machine translation 有权
    使用手动短语对齐数据生成用于统计机器翻译的翻译模型的方法

    公开(公告)号:US08229728B2

    公开(公告)日:2012-07-24

    申请号:US11969518

    申请日:2008-01-04

    IPC分类号: G06F17/20 G06F17/21 G06F17/28

    CPC分类号: G06F17/2818 G06F17/2827

    摘要: The present invention adopts the fundamental architecture of a statistical machine translation system which utilizes statistical models learned from the training data and does not require expert knowledge for rule-based machine translation systems. Out of the training parallel data, a certain amount of sentence pairs are selected for manual alignment. These sentences are aligned at the phrase level instead of at the word level. Depending on the size of the training data, the optimal amount for manual alignment may vary. The alignment is done using an alignment tool with a graphical user interface which is convenient and intuitive to the users. Manually aligned data are then utilized to improve the automatic word alignment component. Model combination methods are also introduced to improve the accuracy and the coverage of statistical models for the task of statistical machine translation.

    摘要翻译: 本发明采用统计机器翻译系统的基础架构,该系统利用从训练数据中获得的统计模型,不需要基于规则的机器翻译系统的专业知识。 在训练并行数据中,选择一定量的句子对进行手动对齐。 这些句子在短语级别而不是单词级别对齐。 根据训练数据的大小,手动校准的最佳量可能会有所不同。 使用具有用户方便和直观的图形用户界面的对准工具进行对准。 然后使用手动对齐的数据来改进自动字对齐组件。 还引入了模型组合方法,以提高统计机器翻译任务的统计模型的准确性和覆盖率。

    Methods for Using Manual Phrase Alignment Data to Generate Translation Models for Statistical Machine Translation
    6.
    发明申请
    Methods for Using Manual Phrase Alignment Data to Generate Translation Models for Statistical Machine Translation 有权
    使用手动短语对齐数据生成统计机器翻译的翻译模型的方法

    公开(公告)号:US20090177460A1

    公开(公告)日:2009-07-09

    申请号:US11969518

    申请日:2008-01-04

    IPC分类号: G06F17/28

    CPC分类号: G06F17/2818 G06F17/2827

    摘要: The present invention adopts the fundamental architecture of a statistical machine translation system which utilizes statistical models learned from the training data and does not require expert knowledge for rule-based machine translation systems. Out of the training parallel data, a certain amount of sentence pairs are selected for manual alignment. These sentences are aligned at the phrase level instead of at the word level. Depending on the size of the training data, the optimal amount for manual alignment may vary. The alignment is done using an alignment tool with a graphical user interface which is convenient and intuitive to the users. Manually aligned data are then utilized to improve the automatic word alignment component. Model combination methods are also introduced to improve the accuracy and the coverage of statistical models for the task of statistical machine translation.

    摘要翻译: 本发明采用统计机器翻译系统的基础架构,该系统利用从训练数据中获得的统计模型,不需要基于规则的机器翻译系统的专业知识。 在训练并行数据中,选择一定量的句子对进行手动对齐。 这些句子在短语级别而不是单词级别对齐。 根据训练数据的大小,手动校准的最佳量可能会有所不同。 使用具有用户方便和直观的图形用户界面的对准工具进行对准。 然后使用手动对齐的数据来改进自动字对齐组件。 还提出了模型组合方法,以提高统计机器翻译任务的统计模型的准确性和覆盖率。

    Short text language detection using geographic information
    7.
    发明授权
    Short text language detection using geographic information 有权
    使用地理信息的短文本语言检测

    公开(公告)号:US08548797B2

    公开(公告)日:2013-10-01

    申请号:US12262145

    申请日:2008-10-30

    CPC分类号: G06F17/275

    摘要: A content-providing entity receives a relatively short text from a user and attempts to determine, automatically, based on that short text (and on other available clues), a language that the user can read and understand. The content-providing entity may then provide, to the user, documents that are written in the determined language. The content-providing entity may determine a language of the input text based on several factors in combination: (a) the service provider's “market,” which is determined based on at least a portion of the URL of the Internet site to which the user directed his browser; (b) the user's “region,” which is determined based on the source Internet Protocol (IP) address of the IP packets that the user sends to the Internet site; (c) the “script” in which the short user-entered text is written; and (d) a statistical analysis of the frequency of the characters present in the short user-entered text.

    摘要翻译: 内容提供实体从用户接收相对短的文本,并尝试基于该短文本(以及其他可用线索)来确定用户可以阅读和理解的语言。 然后,内容提供实体可以向用户提供以确定的语言书写的文档。 内容提供实体可以基于以下几个因素来确定输入文本的语言:(a)服务提供商的“市场”,其基于用户所在的因特网站点的URL的至少一部分来确定 指导他的浏览器 (b)基于用户发送到因特网站点的IP分组的源IP地址确定用户的“区域”; (c)写入短的用户输入的文本的“脚本”; 和(d)对短用户输入文本中存在的字符的频率的统计分析。

    SHORT TEXT LANGUAGE DETECTION USING GEOGRAPHIC INFORMATION
    8.
    发明申请
    SHORT TEXT LANGUAGE DETECTION USING GEOGRAPHIC INFORMATION 有权
    使用地理信息的短文本语言检测

    公开(公告)号:US20100114559A1

    公开(公告)日:2010-05-06

    申请号:US12262145

    申请日:2008-10-30

    IPC分类号: G06F17/20

    CPC分类号: G06F17/275

    摘要: A content-providing entity receives a relatively short text from a user and attempts to determine, automatically, based on that short text (and on other available clues), a language that the user can read and understand. The content-providing entity may then provide, to the user, documents that are written in the determined language. The content-providing entity may determine a language of the input text based on several factors in combination: (a) the service provider's “market,” which is determined based on at least a portion of the URL of the Internet site to which the user directed his browser; (b) the user's “region,” which is determined based on the source Internet Protocol (IP) address of the IP packets that the user sends to the Internet site; (c) the “script” in which the short user-entered text is written; and (d) a statistical analysis of the frequency of the characters present in the short user-entered text.

    摘要翻译: 内容提供实体从用户接收相对短的文本,并尝试基于该短文本(以及其他可用的线索)来确定用户可以阅读和理解的语言。 然后,内容提供实体可以向用户提供以确定的语言书写的文档。 内容提供实体可以基于以下几个因素来确定输入文本的语言:(a)服务提供商的“市场”,其基于用户所在的因特网站点的URL的至少一部分来确定 指导他的浏览器 (b)基于用户发送到因特网站点的IP分组的源IP地址确定用户的“区域”; (c)写入短的用户输入的文本的“脚本”; 和(d)对短用户输入文本中存在的字符的频率的统计分析。

    Infinite browse
    9.
    发明授权
    Infinite browse 有权
    无限浏览

    公开(公告)号:US08600979B2

    公开(公告)日:2013-12-03

    申请号:US12825304

    申请日:2010-06-28

    IPC分类号: G06F7/00

    CPC分类号: G06F17/3089 G06F17/30522

    摘要: An online article is enhanced by displaying, in association with the article, supplemental content that includes entities that are extracted from the article and/or entities that are related to entities that are extracted from the article. The supplemental content further includes information about each of the entities. The information about an entity may be obtained by searching for the entity in one or more searchable repositories of data. For example, the supplemental content may include, for each entity, video, image, web, and/or news search results. The supplemental content may further include information such as stock quotes, abstracts, maps, scores, and so on. The entities are selected using a variety of analyses and ranking techniques based on contextual factors such as user-specific information, time-sensitive popularity trends, grammatical features, search result quality, and so on. The entities may further be selected for purposes such as generating ad-based revenue.

    摘要翻译: 通过与文章相关联地显示包括从文章中提取的实体和/或与从文章中提取的实体相关的实体的补充内容来增强在线文章。 补充内容还包括关于每个实体的信息。 可以通过在一个或多个可搜索的数据库中搜索实体来获得关于实体的信息。 例如,对于每个实体,补充内容可以包括视频,图像,网络和/或新闻搜索结果。 补充内容还可以包括股票报价,摘要,地图,分数等信息。 使用各种基于上下文因素的分析和排序技术来选择实体,例如用户特定信息,时间敏感的人气趋势,语法特征,搜索结果质量等。 可以进一步选择实体,例如生成基于广告的收入。

    Infinite Browse
    10.
    发明申请
    Infinite Browse 有权
    无限浏览

    公开(公告)号:US20110320437A1

    公开(公告)日:2011-12-29

    申请号:US12825304

    申请日:2010-06-28

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3089 G06F17/30522

    摘要: An online article is enhanced by displaying, in association with the article, supplemental content that includes entities that are extracted from the article and/or entities that are related to entities that are extracted from the article. The supplemental content further includes information about each of the entities. The information about an entity may be obtained by searching for the entity in one or more searchable repositories of data. For example, the supplemental content may include, for each entity, video, image, web, and/or news search results. The supplemental content may further include information such as stock quotes, abstracts, maps, scores, and so on. The entities are selected using a variety of analyses and ranking techniques based on contextual factors such as user-specific information, time-sensitive popularity trends, grammatical features, search result quality, and so on. The entities may further be selected for purposes such as generating ad-based revenue.

    摘要翻译: 通过与文章相关联地显示包括从文章中提取的实体和/或与从文章中提取的实体相关的实体的补充内容来增强在线文章。 补充内容还包括关于每个实体的信息。 可以通过在一个或多个可搜索的数据库中搜索实体来获得关于实体的信息。 例如,对于每个实体,补充内容可以包括视频,图像,网络和/或新闻搜索结果。 补充内容还可以包括股票报价,摘要,地图,分数等信息。 使用各种基于上下文因素的分析和排序技术来选择实体,诸如用户特定信息,时间敏感的人气趋势,语法特征,搜索结果质量等。 可以进一步选择实体,例如生成基于广告的收入。