摘要:
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
摘要:
The present disclosure includes, among other things, systems, methods and program products for selecting subsequences (shingles or tuples) generated from sequences of tokens.
摘要:
A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.
摘要:
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
摘要:
A system provides search results from a voice search query. The system receives a voice search query from a user, derives one or more recognition hypotheses, each being associated with a weight, from the voice search query, and constructs a weighted boolean query using the recognition hypotheses. The system then provides the weighted boolean query to a search system and provides the results of the search system to a user.
摘要:
A server computer is provided for representing and navigating the connectivity of Web pages. The Web pages include links to other Web pages. The links and Web page s have associated names (URLs). The names of the Web pages are sorted in a memory of the connectivity server. The sorted names are delta encoded while periodically storing full names as checkpoints in the memory. Each delta encoded name and checkpoint has a unique identification. A list of pairs of identifications representing existent links is sorted twice, first according to the first identification of each pair to produce an inlist, and second according to the second identification of each pair to produce an outlist. An array of elements is stored in the memory, there is one array element for each Web page. Each element includes a first pointer to one of the checkpoints, a second pointer to an associated inlist of the Web page, and a third pointer to an associated outlist of the Web page. The array is indexed by a particular identification to locate connected Web pages.
摘要:
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
摘要:
A system limits search results based on context information. The system obtains the context information and a search query, and obtains a set of references to documents in response to the search query. The system then filters the set of references based on the context information and presents the filtered set of references to a user.
摘要:
A media stream, such as a news broadcast, is supplemented with documents that are relevant to the media stream. The documents may be web pages returned from a search engine. A search query generation component generates search queries for the search engine based on the media stream. A post processing component may re-rank and/or filter the documents to enhance the viewing experience for the user.