摘要:
Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.
摘要:
Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.
摘要:
This specification describes technologies relating to providing search results. One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.
摘要:
Systems, methods, and apparatuses, including computer program products, are provided for machine translation using information retrieval techniques. In general, in one implementation, a method is provided. The method includes providing a received input segment as a query to a search engine, the search engine searching an index of one or more collections of documents, receiving one or more candidate segments in response to the query, determining a similarity of each candidate segment to the received input segment, and for one or more candidate segments having a determined similarity that exceeds a threshold similarity, providing a translated target segment corresponding to the respective candidate segment.
摘要:
A server is configured to receive, from a client, a query and context information associated with a document; obtain search results, based on the query, that identify documents relevant to the query; analyze the context information to identify content; generate first scores for a hierarchy of topics, that correspond to measures of relevance of the topics to the content; select a topic that is most relevant to the context information when the topic is associated with a greatest first score; generate second scores for the search results that correspond to measures of relevance, of the search results, to the topic; select one or more of the search results as being most relevant to the topic when the search results are associated with one or more greatest second scores; generate a search result document that includes the selected search results; and send, to a client, the search result document.
摘要:
Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n-1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus.
摘要:
Systems, methods, and apparatuses including computer program products are provided for encoding and using a language model. In one implementation, a method is provided. The method includes generating a compact language model, including receiving a collection of n-grams, each n-gram having one or more associated parameter values, determining a fingerprint for each n-gram of the collection of n-grams, identifying locations in an array for each n-gram using a plurality of hash functions, and encoding the one or more parameter values associated with each n-gram in the identified array locations as a function of corresponding array values and the fingerprint for the n-gram.
摘要:
A semantic locator determines whether input sequences form semantically meaningful units. The semantic locator includes a coherence component that calculates a coherence of the terms in the sequence and a variation component that calculates the variation in terms that surround the sequence. A heuristics component may additionally refine results of the coherence component and the variation component. A decision component may make the determination of whether the sequence is a semantic unit based on the results of the coherence component, variation component, and heuristics component.
摘要:
A semantic locator determines whether input sequences form semantically meaningful units. The semantic locator includes a coherence component that calculates a coherence of the terms in the sequence and a variation component that calculates the variation in terms that surround the sequence. A heuristics component may additionally refine results of the coherence component and the variation component. A decision component may make the determination of whether the sequence is a semantic unit based on the results of the coherence component, variation component, and heuristics component.
摘要:
Techniques for new event detection are provided. For a new story and a corpus of stories, story-pairs based on the new story and each corpus story are determined. Adjustments to the importance of terms are determined based on story characteristics associated with each story. Story characteristics are based on direct or indirect characteristics. Direct story characteristics include authorship, language associated with a story and the like. Indirect story characteristics may include derived characteristics such as an ROI category characteristic, a same ROI characteristic, a same event-same source characteristic, an average story similarity characteristic or any other known or later developed characteristic associated with a story. Adjustments to the inter-story similarity metrics are then determined based on story characteristics and/or a weighting function. New event scores and/or new event categorizations for stories are determined based on the inter-story similarity metrics and the adjustments based on the story characteristics. Optionally new events are selected based on new event scores and a threshold value.