摘要:
Disclosed herein is a technique for providing an interface that allows a user to navigate backwards through linked webpages. Initially, a request to display inlinks of linking webpages that contain a link to a particular webpage is received. In response to the request, a new page that contains a set of inlinks that correspond to a set of linking webpages that each contain a link to the particular webpage is provided. Each of the inlinks may be associated with a particular clickable item. An indication of a selection of a clickable item, associated with a particular inlink is received. In response, a second new page which contains a second set of inlinks that correspond to a second set of linking webpages that each contain a link to the webpage that corresponds to the particular inlink is provided. Some of the displayed inlinks may correspond to webpages that redirect to the particular webpage.
摘要:
Disclosed herein is a technique for providing an interface that allows a user to navigate backwards through linked webpages. Initially, a request to display inlinks of linking webpages that contain a link to a particular webpage is received. In response to the request, a new page that contains a set of inlinks that correspond to a set of linking webpages that each contain a link to the particular webpage is provided. Each of the inlinks may be associated with a particular clickable item. An indication of a selection of a clickable item, associated with a particular inlink is received. In response, a second new page which contains a second set of inlinks that correspond to a second set of linking webpages that each contain a link to the webpage that corresponds to the particular inlink is provided. Some of the displayed inlinks may correspond to webpages that redirect to the particular webpage.
摘要:
Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.
摘要:
Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.
摘要:
Disclosed herein is use of a preview of content from a target document, as provided by a content preview source such as a Really Simple Syndication (RSS) feed, by a search engine. The content preview source includes the preview of the target document's content and a reference, e.g., a Universal Resource Locator (URL) or other link. A content preview document is generated using data extracted from the content preview source. The content preview document is made available in a searchable index used by a search engine to respond to a search query. A fetch operation is scheduled to fetch the target document using the reference provided in the content preview source. Once fetched, the data extracted from the content preview source can be associated with the target document, and can be used in presenting the target document in search results.
摘要:
Disclosed herein is use of a preview of content from a target document, as provided by a content preview source such as a Really Simple Syndication (RSS) feed, by a search engine. The content preview source includes the preview of the target document's content and a reference, e.g., a Universal Resource Locator (URL) or other link. A content preview document is generated using data extracted from the content preview source. The content preview document is made available in a searchable index used by a search engine to respond to a search query. A fetch operation is scheduled to fetch the target document using the reference provided in the content preview source. Once fetched, the data extracted from the content preview source can be associated with the target document, and can be used in presenting the target document in search results.
摘要:
According to the approach described herein, an approach is provided for identifying transient links on a Web page by crawling a Web page consecutively after a brief interval and comparing the links from each crawl to identify transient links. The approach ensures that transient links are not crawled and archived, thereby saving resources for crawling valid links leading to useful information