摘要:
Methods, systems, and computer programs for categorizing the sensitivity of web pages are presented. In one method, a space of sensitive pages is identified based on the sensitivity categorization of a first plurality of web pages and a second plurality of web pages. The first plurality of web pages is obtained by performing search queries using known sensitive words, and the second plurality of web pages includes randomly selected web pages. Additionally, the method identifies a third plurality of web pages that includes web pages on or near the boundary between the space of sensitive pages and the space of non-sensitive pages. The space of sensitive pages is then redefined based on the sensitivity categorization of the first, second, and third pluralities of web pages. Once the space of sensitive pages is defined, the method is used to determine that a given web page is sensitive when the given web page is in the space of sensitive pages. Web pages are included in a marketing operation when the web pages are not sensitive.
摘要:
Methods and apparatus are described for measuring the topical coherence of a keyword set while simultaneously partitioning the set into contextually related clusters.
摘要:
A method and apparatus is described for assigning functional labels to segments of web pages in an application-independent way. In the approach described herein, one of a generic set functional labels are automatically assigned to each segment of a web page, where the generic functional labels may be topic-independent and application-independent. Applications with different needs can determine which segments of the web page to process based on which functional labels correspond to the types of information needed by each application. Thus, the work of classifying the function of each segment of a web page is separated from the work of selecting which segments satisfy the need of a particular application. The work of classification can be performed in an application-independent way, relieving the burden from every application developer from having to create their own classifiers.
摘要:
Methods, systems, and computer programs for categorizing the sensitivity of web pages are presented. In one method, a space of sensitive pages is identified based on the sensitivity categorization of a first plurality of web pages and a second plurality of web pages. The first plurality of web pages is obtained by performing search queries using known sensitive words, and the second plurality of web pages includes randomly selected web pages. Additionally, the method identifies a third plurality of web pages that includes web pages on or near the boundary between the space of sensitive pages and the space of non-sensitive pages. The space of sensitive pages is then redefined based on the sensitivity categorization of the first, second, and third pluralities of web pages. Once the space of sensitive pages is defined, the method is used to determine that a given web page is sensitive when the given web page is in the space of sensitive pages. Web pages are included in a marketing operation when the web pages are not sensitive.
摘要:
Methods and apparatus are described for measuring the topical coherence of a keyword set while simultaneously partitioning the set into contextually related clusters.
摘要:
A method and apparatus is described for assigning functional labels to segments of web pages in an application-independent way. In the approach described herein, one of a generic set functional labels are automatically assigned to each segment of a web page, where the generic functional labels may be topic-independent and application-independent. Applications with different needs can determine which segments of the web page to process based on which functional labels correspond to the types of information needed by each application. Thus, the work of classifying the function of each segment of a web page is separated from the work of selecting which segments satisfy the need of a particular application. The work of classification can be performed in an application-independent way, relieving the burden from every application developer from having to create their own classifiers.
摘要:
A method of classifying documents includes: specifying multiple documents and classes, wherein each document includes a plurality of features and each document corresponds to one of the classes; determining reduced document vectors for the classes from the documents, wherein the reduced document vectors include features that satisfy threshold conditions corresponding to the classes; determining reduced weight vectors for relating the documents to the classes by comparing combinations of the reduced weight vectors and the reduced document vectors and separating the corresponding classes; and saving one or more values for the reduced weight vectors and the classes. Specific embodiments are directed to formulations for determining the reduced weight vectors including one-versus-rest classifiers, maximum entropy classifiers, and direct multiclass Support Vector Machines.
摘要:
Methods and apparatus are described for measuring the topical coherence of a keyword set while simultaneously partitioning the set into contextually related clusters.