Abstract:
Methods, systems, and apparatus, including computer program products, for language translation are disclosed. In one implementation, a method is provided. The method includes accessing a hypothesis space; performing decoding on the hypothesis space to obtain a translation hypothesis that minimizes an expected error in classification calculated relative to an evidence space; and providing the obtained translation hypothesis for use by a user as a suggested translation in a target translation.
Abstract:
A method and apparatus are provided for updating the vocabulary of a speech translation system for translating a first language into a second language including written and spoken words. The method includes adding a new word in the first language to a first recognition lexicon of the first language and associating a description with the new word, wherein the description contains pronunciation and word class information. The new word and description are then updated in a first machine translation module associated with the first language. The first machine translation module contains a first tagging module, a first translation model and a first language module, and is configured to translate the new word to a corresponding translated word in the second language. Optionally, the invention may be used for bidirectional or multi-directional translation.
Abstract:
A method of determining the consistency of training data for a machine translation system is disclosed. The method includes receiving a signal indicative of a source language corpus and a target language corpus. A textual string is extracted from the source language corpus. The textual string is aligned with the target language corpus to identify a translation for the textual string from the target language corpus. A consistency index is calculated based on a relationship between the textual string from the source language corpus and the translation. An indication of thethe consistency index is stored on a tangible medium.
Abstract:
Improved methods, apparatus and computer program products implementing an index-based search algorithm for use with a translation program are disclosed herein. In the method, a group of words (such as menu items) are selected for translation from one language to another. Then, an index for use in conducting search-and-match operations during language translation operations is constructed for the group of words. In constructing an index, the frequency of appearance of characters used in the words to be translated is calculated. Then, coverage calculations are made based on various selections of key characters. Next, key characters are chosen based on the coverage calculations. Then, a key character index is constructed using the key characters. In one embodiment, the key character index comprises for each word an identification of which key characters appear in the word, and where the key characters appear in the word. After the key character index is constructed, words not containing key characters are identified in a specific set. The key character index and specific set index are then used during search and match language translation operations.
Abstract:
A statistical machine translation (MT) system may use a large monolingual corpus to improve the accuracy of translated phrases/sentences. The MT system may produce a alternative translations and use the large monolingual corpus to (re)rank the alternative translations.
Abstract:
A method includes detecting a syntactic chunk in a source string in a first language, assigning a syntactic label to the detected syntactic chunk in the source string, mapping the detected syntactic chunk in the source string to a syntactic chunk in a target string in a second language, said mapping based on the assigned syntactic label, and translating the source string into a possible translation in the second language.
Abstract:
A method and system for reducing lexical ambiguity in an input stream (302) are described. In one embodiment, the input stream (302) is broken into tokens (308). The tokens are used to create a connection graph comprising a number of paths. Each of the paths is assigned a cost (318). At least one best path is defined based upon a corresponding cost to generate an output graph (322). The generated output graph is provided to reduce lexical ambiguity.
Abstract:
In one aspect, the invention relates to a method of allowing a user to view and modify a weighting associated with a translation of a source language string. The method includes displaying to the user the weighting associated with the translation of the source language string, the weighting for use by a translation engine in selecting the translation and allowing the usr to modify the weighting associated with the translation. In one embodiment, the method further includes allowing the user to reset the weighting back to a default value subsequent to user modification of the weighting.
Abstract:
A discourse structure for an input text segment is determined by generating a set of one or more discourse parsing decision rules based on a training set, and determining a discourse structure for the input text segment by applying the generated set of discourse parsing decision rules to the input text segment. A tree structure is summarized by generating a set of one or more summarization decision rules based on a training set, and compressing the tree structure by applying the generated set of summarization decision rules to the tree structure. Alternatively, summarization is accomplished by parsing an input text segment to generate a parse tree for the input segment, generating a plurality of potential solutions, applying a statistical model to determine a probability of correctness for each of potential solution, and extracting one or more high-probability solutions based on the solutions' respective determined probabilities of correctness.
Abstract:
A method and system for reducing lexical ambiguity in an input stream (302) are described. In one embodiment, the input stream (302) is broken into tokens (308). The tokens are used to create a connection graph comprising a number of paths. Each of the paths is assigned a cost (318). At least one best path is defined based upon a corresponding cost to generate an output graph (322). The generated output graph is provided to reduce lexical ambiguity.