摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a sequence of tokens via a walk algorithm. The sequence is fingerprinted to form a set of shingles. The singles are compared to shingles for other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a sequence of tokens via a walk algorithm. The sequence is fingerprinted to form a set of shingles. The singles are compared to shingles for other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
As provided herein objects from a source catalog, such as a provider's catalog, can be added to a target catalog, such as an enterprise master catalog, in a scalable manner utilizing catalog taxonomies. A baseline classifier determines probabilities for source objects to target catalog classes. Source objects can be assigned to those classes with probabilities that meet a desired threshold and meet a desired rate. A classification cost for target classes can be determined for respective unassigned source objects, which can comprise determining an assignment cost and separation cost for the source objects for respective desired target classes. The separation and assignment costs can be combined to determine the classification cost, and the unassigned source objects can be assigned to those classes having a desired classification cost.
摘要:
Techniques are provided for identifying topics that are unassociated with a dominant URL. A set of keywords associated with a topic is identified. A search log is scanned to identify search queries associated with the set of keywords. The identified search queries are grouped into clusters. Clusters associated with similar URLs are merged to generate an extended seed query string. The extended seed query string is analyzed to determine whether it relates to an existing dominant URL. If the extended seed query string is determined to be unassociated with an existing dominant URL, a web page associated with the topic may be generated.
摘要:
Techniques are provided for identifying topics that are unassociated with a dominant URL. A set of keywords associated with a topic is identified. A search log is scanned to identify search queries associated with the set of keywords. The identified search queries are grouped into clusters. Clusters associated with similar URLs are merged to generate an extended seed query string. The extended seed query string is analyzed to determine whether it relates to an existing dominant URL. If the extended seed query string is determined to be unassociated with an existing dominant URL, a web page associated with the topic may be generated.
摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a set of weighted features. The set of weighted features are then transformed into a signature via a SimHash algorithm. The signature is compared to the signature of one or more other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a sequence of tokens via a walk algorithm. The sequence is fingerprinted to form a set of shingles. The singles are compared to shingles for other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a set of weighted features. The set of weighted features are then transformed into a signature via a SimHash algorithm. The signature is compared to the signature of one or more other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
This disclosure describes systems and methods for identifying and correcting anomalies in web graphs. A web graph is transformed into a set of weighted features. The set of weighted features are then transformed into a signature via a SimHash algorithm. The signature is compared to the signature of one or more other web graphs in order to determine similarity between web graphs. Actions are then carried out to remove anomalous web graphs and modify parameters governing web mapping in order to decrease the likelihood of future anomalous web graphs being built.
摘要:
As provided herein objects from a source catalog, such as a provider's catalog, can be added to a target catalog, such as an enterprise master catalog, in a scalable manner utilizing catalog taxonomies. A baseline classifier determines probabilities for source objects to target catalog classes. Source objects can be assigned to those classes with probabilities that meet a desired threshold and meet a desired rate. A classification cost for target classes can be determined for respective unassigned source objects, which can comprise determining an assignment cost and separation cost for the source objects for respective desired target classes. The separation and assignment costs can be combined to determine the classification cost, and the unassigned source objects can be assigned to those classes having a desired classification cost.