摘要:
Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.
摘要:
Automated crawling of page links associated with a site domain that was previously crawled involves computing the dynamicity of a site based on totals of continuous dead links, live links and/or prerequisite pages encountered while crawling page links corresponding to the site. The degree to which links are crawled is optimized based on the dynamicity of the site. Some pages require that another particular page (i.e., a prerequisite page) is retrieved from the host prior to retrieving a given page, e.g., so that the prerequisite page can set a cookie. Prerequisite pages are determined based on stored information about pages that were retrieved, during a previous crawl, prior to retrieving a page. Prerequisite pages are identified to a search system so that when a user clicks on the URL for the page, the request is redirected to the prerequisite page to set the cookie appropriately.
摘要:
Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.
摘要:
A system and method for assessing excessive accessory listings in search results includes a processor-implemented textual mining module that parses a data field of a document and generates at least one token from the data field. A processor-implemented scoring module calculates a score for the at least one token, with the at least one token score representing a likelihood that the at least one token belongs to one of two binary classifications. The processor-implemented scoring module also calculates a score for the document based on the at least one token score, with the document score representing a probability of the document being in one of the two binary classifications. A processor-implemented decision tree module inputs the document score and document attribute values into a decision tree and generates an output representing a refined score based on the document score and at least one of the document attribute values.
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
摘要:
Various embodiments include systems and methods for search result ranking using machine learning. A goal model can be created using machine learning. Responsive to a search query, a plurality of data factors can be inputted into the goal model to create a model output. Search results can be presented to a user based on the model output.
摘要:
Various embodiments include systems and methods for search result ranking using machine learning. A goal model can be created using machine learning. Responsive to a search query, a plurality of data factors can be inputted into the goal model to create a model output. Search results can be presented to a user based on the model output.
摘要:
A system and method for determining a ranking function for a search engine. A training data processor receives training data, the training data including at least a first page, a first label, a second page and a second label. A feature extraction processor receives the first page, identifies first features in the first page and calculates first values relating to the first features. The feature extraction processor receives the second page and identifies second features and calculates second values relating to the second features. A machine learning processor receives the first features, the first values, the first label, the second features, the second values, and the second label. The machine learning processor generates a ranking function based on first features, the first values, the first label, the second features, the second values, and the second label.
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
摘要:
Unsupervised crawling of the hidden Web utilizes a query engine, coupled to a crawler system, that automatically and intelligently inserts keywords into text input controls in Web page forms so that the filled form can be submitted to a server to retrieve dynamically generated Web content for indexing. The keywords used to fill form controls are based on the content of corresponding Web pages, which is automatically discovered to generate a set of keywords for filling the controls. The set of keywords can be expanded to include related keywords from other Web pages and Web sites and, therefore, to provide more effective coverage for crawling the Web content. The expanded set of keywords can be continuously expanded by recursively performing similarity analyses based on results from crawling the same and other Web sites.