摘要:
Various embodiments include systems and methods for search result ranking using machine learning. A goal model can be created using machine learning. Responsive to a search query, a plurality of data factors can be inputted into the goal model to create a model output. Search results can be presented to a user based on the model output.
摘要:
Various embodiments include systems and methods for search result ranking using machine learning. A goal model can be created using machine learning. Responsive to a search query, a plurality of data factors can be inputted into the goal model to create a model output. Search results can be presented to a user based on the model output.
摘要:
A method and a system generate a reputation value for a user in a network-based community. A processor-implemented transaction data collector module collects transaction data of users of a network-based community. A processor-implemented transaction graph generator module generates a transaction graph based on the collected transaction data. The transaction graph has a transaction relationship between two users, and a weight corresponding to the transaction relationship. The weight is representative of a mutually reinforcing relationship between two users. A processor-implemented reputation generator module generates a reputation value for a user from the transaction graph.
摘要:
A system and method for assessing excessive accessory listings in search results includes a processor-implemented textual mining module that parses a data field of a document and generates at least one token from the data field. A processor-implemented scoring module calculates a score for the at least one token, with the at least one token score representing a likelihood that the at least one token belongs to one of two binary classifications. The processor-implemented scoring module also calculates a score for the document based on the at least one token score, with the document score representing a probability of the document being in one of the two binary classifications. A processor-implemented decision tree module inputs the document score and document attribute values into a decision tree and generates an output representing a refined score based on the document score and at least one of the document attribute values.
摘要:
A system and method for assessing excessive accessory listings in search results includes a processor-implemented textual mining module that parses a data field of a document and generates at least one token from the data field. A processor-implemented scoring module calculates a score for the at least one token, with the at least one token score representing a likelihood that the at least one token belongs to one of two binary classifications. The processor-implemented scoring module also calculates a score for the document based on the at least one token score, with the document score representing a probability of the document being in one of the two binary classifications. A processor-implemented decision tree module inputs the document score and document attribute values into a decision tree and generates an output representing a refined score based on the document score and at least one of the document attribute values.
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
摘要:
Techniques for correcting miscategorized features excerpted from web pages are provided. For each of several categories and several pages on a particular web site, a separate feature may be excerpted from that page and associated with that page in relation to that category. Often, many of the “high confidence” features that have been associated with the same category are found to be associated with similar characteristics regardless of the pages from which those features were excerpted. Thus, a set of category characteristics, which are often found associated with the “high confidence” features in a particular category, may be determined. For each page, a candidate feature that is associated with the set of category characteristics may be identified in that page. If, in relation to the particular category, a feature other than the candidate feature is associated with that page, then that other feature may be replaced by the candidate feature.
摘要:
A system and method for determining a ranking function for a search engine. A training data processor receives training data, the training data including at least a first page, a first label, a second page and a second label. A feature extraction processor receives the first page, identifies first features in the first page and calculates first values relating to the first features. The feature extraction processor receives the second page and identifies second features and calculates second values relating to the second features. A machine learning processor receives the first features, the first values, the first label, the second features, the second values, and the second label. The machine learning processor generates a ranking function based on first features, the first values, the first label, the second features, the second values, and the second label.
摘要:
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
摘要:
Unsupervised crawling of the hidden Web utilizes a query engine, coupled to a crawler system, that automatically and intelligently inserts keywords into text input controls in Web page forms so that the filled form can be submitted to a server to retrieve dynamically generated Web content for indexing. The keywords used to fill form controls are based on the content of corresponding Web pages, which is automatically discovered to generate a set of keywords for filling the controls. The set of keywords can be expanded to include related keywords from other Web pages and Web sites and, therefore, to provide more effective coverage for crawling the Web content. The expanded set of keywords can be continuously expanded by recursively performing similarity analyses based on results from crawling the same and other Web sites.