摘要:
Methods and apparatus are described for measuring the topical coherence of a keyword set while simultaneously partitioning the set into contextually related clusters.
摘要:
Methods, systems, and computer programs for categorizing the sensitivity of web pages are presented. In one method, a space of sensitive pages is identified based on the sensitivity categorization of a first plurality of web pages and a second plurality of web pages. The first plurality of web pages is obtained by performing search queries using known sensitive words, and the second plurality of web pages includes randomly selected web pages. Additionally, the method identifies a third plurality of web pages that includes web pages on or near the boundary between the space of sensitive pages and the space of non-sensitive pages. The space of sensitive pages is then redefined based on the sensitivity categorization of the first, second, and third pluralities of web pages. Once the space of sensitive pages is defined, the method is used to determine that a given web page is sensitive when the given web page is in the space of sensitive pages. Web pages are included in a marketing operation when the web pages are not sensitive.
摘要:
Methods and apparatus are described for measuring the topical coherence of a keyword set while simultaneously partitioning the set into contextually related clusters.
摘要:
A system for training classifiers in multiple categories through an active learning system, including a computer having a memory and a processor, the processor programmed to: train an initial set of m binary one-versus-all classifiers, one for each category in a taxonomy, on a labeled dataset of examples stored in a database coupled with the computer; uniformly sample up to a predetermined large number of examples from a second, larger dataset of unlabeled examples stored in a database coupled with the computer; order the sampled unlabeled examples in order of informativeness for each classifier; determine a minimum subset of the unlabeled examples that are most informative for a maximum number of the classifiers to form an active set for learning; and use editorially-labeled versions of the examples of the active set to re-train the classifiers, thereby improving the accuracy of at least some of the classifiers.