Abstract:
Some examples may include generating a browsing history language model based on browsing history information. Further, some implementations may include predicting and presenting a non-Latin character string based at least in part on the browsing history language model, such as in response to receiving a Latin character string via an input method editor interface.
Abstract:
Some examples include generating a personal language model based on linguistic characteristics of one or more files stored at one or more locations in a file system. Further, some implementations include predicting and presenting a non-Latin character string based at least in part on the personal language model, such as in response to receiving a Latin character string via an input method editor interface.
Abstract:
Some examples include generating a personal language model based on linguistic characteristics of one or more files stored at one or more locations in a file system. Further, some implementations include predicting and presenting a non-Latin character string based at least in part on the personal language model, such as in response to receiving a Latin character string via an input method editor interface.
Abstract:
Disclosed is a method for displaying an advertisement. The method displays a present advertisement, determines whether the present advertisement has been displayed completely, and adds an identifier of the present advertisement to a priority advertisement list if the present advertisement has not been displayed completely. The method sends the priority advertisement list to the advertisement engine when requesting the advertisement engine for displaying a next advertisement. Using the priority advertisement list, the advertisement engine may give priority to the present advertisement in next advertisement assignment. Using an optimized advertisement display strategy, the disclosed method may increase coverage rates of advertisement contents to audiences, thereby improving advertisement effectiveness for advertisers and increasing cash flow return for website owners.
Abstract:
A method and apparatus for segmenting text is provided that identifies a sequence of entity types from a sequence of characters and thereby identifies a segmentation for the sequence of characters. Under the invention, the sequence of entity types is identified using probabilistic models that describe the likelihood of a sequence of entities and the likelihood of sequences of characters given particular entities. Under one aspect of the invention, organization name entities are identified from a first sequence of identified entities to form a final sequence of identified entities.
Abstract:
Candidate suggestions for correcting misspelled query terms input into a search application are automatically generated. A score for each candidate suggestion can be generated using a first decoding pass and paths through the suggestions can be ranked in a second decoding pass. Candidate suggestions can be generated based on typographical errors, phonetic mistakes and/or compounding mistakes. Furthermore, a ranking model can be developed to rank candidate suggestions to be presented to a user.
Abstract:
A method for resolving overlapping ambiguity strings in unsegmented languages such as Chinese. The methodology includes segmenting sentences into two possible segmentations and recognizing overlapping ambiguity strings in the sentences. One of the two possible segmentations is selected as a function of probability information. The probability information is derived from unsupervised training data. A method of constructing a knowledge base containing probability information needed to select one of the segmentation is also provided.
Abstract:
A distributional similarity between a word of a search query and a term of a candidate word sequences is used to determine an error model probability that describes the probability of the search query given the candidate word sequence. The error model probability is used to determine a probability of the candidate word sequence given the search query. The probability of the candidate word sequence given the search query is used to select a candidate word sequence as a corrected word sequence for the search query. Distributional similarity is also used to build features that are applied in maximum entropy model to compute the probability of the candidate word sequence given the search query.
Abstract:
An ensemble of random feature clusters is built from training data using a clustering algorithm where some randomness has been introduced. For each clustered feature space, a classifier, such as a Naïve Bayesian Classifier, is trained, realizing a classifier ensemble. The final classification decision is made by the resulting classifier ensemble.
Abstract:
A method of post-processing character data from an optical character recognition (OCR) engine and apparatus to perform the method. This exemplary method includes segmenting the character data into a set of initial words. The set of initial words is word level processed to determine at least one candidate word corresponding to each initial word. The set of initial words is segmented into a set of sentences. Each sentence in the set of sentences includes a plurality of initial words and candidate words corresponding to the initial words. A sentence is selected from the set of sentences. The selected sentence is word disambiguity processed to determine a plurality of final words. A final word is selected from the at least one candidate word corresponding to a matching initial word. The plurality of final words is then assembled as post-processed OCR data.