摘要:
Traditional statistical machine translation systems learn all information from a sentence aligned parallel text and are known to have problems translating between structurally diverse languages. To overcome this limitation, the present invention introduces two-level training, which incorporates syntactic chunking into statistical translation. A chunk-alignment step is inserted between the sentence-level and word-level training, which allows differing training for these two sources of information in order to learn lexical properties from the aligned chunks and learn structural properties from chunk sequences. The system consists of a linguistic processing step, two level training, and a decoding step which combines chunk translations of multiple sources and multiple language models.
摘要:
The performance of traditional speech recognition systems (as applied to information extraction or translation) decreases significantly with, larger domain size, scarce training data as well as under noisy environmental conditions. This invention mitigates these problems through the introduction of a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language. The predictive features are combined with text classifiers to map the noisy text to one of the semantically or functionally similar groups. The features used by the classifier can be syntactic, semantic, and statistical.
摘要:
The performance of traditional speech recognition systems (as applied to information extraction or translation) decreases significantly with, larger domain size, scarce training data as well as under noisy environmental conditions. This invention mitigates these problems through the introduction of a novel predictive feature extraction method which combines linguistic and statistical information for representation of information embedded in a noisy source language. The predictive features are combined with text classifiers to map the noisy text to one of the semantically or functionally similar groups. The features used by the classifier can be syntactic, semantic, and statistical.
摘要:
The present invention disclose modular speech-to-speech translation systems and methods that provide adaptable platforms to enable verbal communication between speakers of different languages within the context of specific domains. The components of the preferred embodiments of the present invention includes: (1) speech recognition; (2) machine translation; (3) N-best merging module; (4) verification; and (5) text-to-speech. Characteristics of the speech recognition module here are that the modules are structured to provide N-best selections and multi-stream processing, where multiple speech recognition engines may be active at any one time. The N-best lists from the one or more speech recognition engines may be handled either separately or collectively to improve both recognition and translation results. A merge module is responsible for integrating the N-best outputs of the translation engines along with confidence/translation scores to create a ranked list or recognition-translation pairs.
摘要:
The present invention adopts the fundamental architecture of a statistical machine translation system which utilizes statistical models learned from the training data and does not require expert knowledge for rule-based machine translation systems. Out of the training parallel data, a certain amount of sentence pairs are selected for manual alignment. These sentences are aligned at the phrase level instead of at the word level. Depending on the size of the training data, the optimal amount for manual alignment may vary. The alignment is done using an alignment tool with a graphical user interface which is convenient and intuitive to the users. Manually aligned data are then utilized to improve the automatic word alignment component. Model combination methods are also introduced to improve the accuracy and the coverage of statistical models for the task of statistical machine translation.
摘要:
The present invention adopts the fundamental architecture of a statistical machine translation system which utilizes statistical models learned from the training data and does not require expert knowledge for rule-based machine translation systems. Out of the training parallel data, a certain amount of sentence pairs are selected for manual alignment. These sentences are aligned at the phrase level instead of at the word level. Depending on the size of the training data, the optimal amount for manual alignment may vary. The alignment is done using an alignment tool with a graphical user interface which is convenient and intuitive to the users. Manually aligned data are then utilized to improve the automatic word alignment component. Model combination methods are also introduced to improve the accuracy and the coverage of statistical models for the task of statistical machine translation.
摘要:
A content-providing entity receives a relatively short text from a user and attempts to determine, automatically, based on that short text (and on other available clues), a language that the user can read and understand. The content-providing entity may then provide, to the user, documents that are written in the determined language. The content-providing entity may determine a language of the input text based on several factors in combination: (a) the service provider's “market,” which is determined based on at least a portion of the URL of the Internet site to which the user directed his browser; (b) the user's “region,” which is determined based on the source Internet Protocol (IP) address of the IP packets that the user sends to the Internet site; (c) the “script” in which the short user-entered text is written; and (d) a statistical analysis of the frequency of the characters present in the short user-entered text.
摘要:
A content-providing entity receives a relatively short text from a user and attempts to determine, automatically, based on that short text (and on other available clues), a language that the user can read and understand. The content-providing entity may then provide, to the user, documents that are written in the determined language. The content-providing entity may determine a language of the input text based on several factors in combination: (a) the service provider's “market,” which is determined based on at least a portion of the URL of the Internet site to which the user directed his browser; (b) the user's “region,” which is determined based on the source Internet Protocol (IP) address of the IP packets that the user sends to the Internet site; (c) the “script” in which the short user-entered text is written; and (d) a statistical analysis of the frequency of the characters present in the short user-entered text.
摘要:
An online article is enhanced by displaying, in association with the article, supplemental content that includes entities that are extracted from the article and/or entities that are related to entities that are extracted from the article. The supplemental content further includes information about each of the entities. The information about an entity may be obtained by searching for the entity in one or more searchable repositories of data. For example, the supplemental content may include, for each entity, video, image, web, and/or news search results. The supplemental content may further include information such as stock quotes, abstracts, maps, scores, and so on. The entities are selected using a variety of analyses and ranking techniques based on contextual factors such as user-specific information, time-sensitive popularity trends, grammatical features, search result quality, and so on. The entities may further be selected for purposes such as generating ad-based revenue.
摘要:
An online article is enhanced by displaying, in association with the article, supplemental content that includes entities that are extracted from the article and/or entities that are related to entities that are extracted from the article. The supplemental content further includes information about each of the entities. The information about an entity may be obtained by searching for the entity in one or more searchable repositories of data. For example, the supplemental content may include, for each entity, video, image, web, and/or news search results. The supplemental content may further include information such as stock quotes, abstracts, maps, scores, and so on. The entities are selected using a variety of analyses and ranking techniques based on contextual factors such as user-specific information, time-sensitive popularity trends, grammatical features, search result quality, and so on. The entities may further be selected for purposes such as generating ad-based revenue.