摘要:
Disclosed herein are method, systems and architectures for normalizing identifiers corresponding to resources using normalization rules that can be generalized for use with different resources. By way of a non-limiting example, an identifier can be a uniform resource locator (URL), and a normalization rule can be used to normalize URLs that correspond to different resources, e.g., content. A normalization rule can be generated by generalizing two or more normalization rules corresponding to different resources, such that a content determinative component is generalized. A normalization rule can be defined to include a context portion used to determine the rule's applicability to an identifier, and a transformation portion that identifies the transformations to be applied to an applicable identifier to yield a normalized form of the URL. A generalization of two or more normalization rules can include a normalization of one or both of the context and transformation portions.
摘要:
Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.
摘要:
Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.
摘要:
A method and apparatus are provided for identifying if two websites are co-owned. In one example, the method includes obtaining redirect URL (uniform resource locator) pairs from the Internet, constructing a training set using the redirect URL pairs, constructing a feature set based on the training set, and learning co-ownership decisions based on the feature set and the training set.
摘要:
A geographic region is automatically determined for an Internet resource based on information that has been gathered over time through the automatic monitoring of certain “click” activities of Internet search engine-using users. Over time, the search engine collects information for each click. Using this click-related data, the search engine estimates the geographic region with which the resource ought to be associated. The fact that a significant proportion of clicks on a resource's hyperlink are clicks that “came through” a search engine portal that is associated with a geographic region tends to suggest that the resource ought to be associated with that geographic region. Similarly, the fact that a significant proportion of clicks on a resource's hyperlink are clicks that were made by users whose computers have IP addresses that are associated with a geographic region tends to suggest that the resource ought to be associated with that geographic region.
摘要:
A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.
摘要:
A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.
摘要:
A system for automatically correcting a mistaken geocoded location input. A wireless device such as a cell phone ranks possible location input based on edit distance, which is a ‘confidence factor’. If there is no perfect match, then a list of geocode options is returned, preferably sorted by the score. The ‘closeness’ is derived by looking at the edit distance to go from the input to the matched address. Edit distance is defined herein as the number of insertion/deletion/replacement operations to go from input location to the possible matched location. In one embodiment, an option list, or ‘pick list’, may be provided based on an edit distance scoring system. The edit distance scoring system is preferably based on a number of keystrokes difference between the input location name and the possible matched location name.
摘要:
Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected website to identify one or more content items, and downloading one or more content items of the selected website. A determination is then made as to the one or more linking relationships from the one or more content items of the selected website and one or more linking rules are learned based upon association rule mining of the one or more content items. The one or more linking rules are then applied to one or more content items of one or more websites in order to determine storage of the one or more content items based upon the one or more linking rules on a search provider's central server.