摘要:
A method and system for classifying display pages based on automatically generated summaries of display pages. A web page classification system uses a web page summarization system to generate summaries of web pages. The summary of a web page may include the sentences of the web page that are most closely related to the primary topic of the web page. The summarization system may combine the benefits of multiple summarization techniques to identify the sentences of a web page that represent the primary topic of the web page. Once the summary is generated, the classification system may apply conventional classification techniques to the summary to classify the web page. The classification system may use conventional classification techniques such as a Naïve Bayesian classifier or a support vector machine to identify the classifications of a web page based on the summary generated by the summarization system.
摘要:
A method and system for calculating the significance of a sentence within a document is provided. The summarization system calculates the significance of the sentences of a document and selects the most significant sentences as the summary of the document. The summarization system calculates the significance of a sentence based on the “important” words of the document that are contained within the sentence. The summarization system calculates the importance of words of the document using various scoring techniques and then combines the scores to classify a word as important or not important. The summarization system can then be used to identify significant sentences of the document based on the important words that a sentence contains and select significant sentences as a summary of the document.
摘要:
A method and system for calculating the significance of a sentence within a document is provided. The summarization system calculates the significance of the sentences of a document and selects the most significant sentences as the summary of the document. The summarization system calculates the significance of a sentence based on the “important” words of the document that are contained within the sentence. The summarization system calculates the importance of words of the document using various scoring techniques and then combines the scores to classify a word as important or not important. The summarization system can then be used to identify significant sentences of the document based on the important words that a sentence contains and select significant sentences as a summary of the document.
摘要:
A method and system for classifying display pages based on automatically generated summaries of display pages. A web page classification system uses a web page summarization system to generate summaries of web pages. The summary of a web page may include the sentences of the web page that are most closely related to the primary topic of the web page. The summarization system may combine the benefits of multiple summarization techniques to identify the sentences of a web page that represent the primary topic of the web page. Once the summary is generated, the classification system may apply conventional classification techniques to the summary to classify the web page. The classification system may use conventional classification techniques such as a Naïve Bayesian classifier or a support vector machine to identify the classifications of a web page based on the summary generated by the summarization system.
摘要:
A method and system for classifying display pages based on automatically generated summaries of display pages. A web page classification system uses a web page summarization system to generate summaries of web pages. The summary of a web page may include the sentences of the web page that are most closely related to the primary topic of the web page. The summarization system may combine the benefits of multiple summarization techniques to identify the sentences of a web page that represent the primary topic of the web page. Once the summary is generated, the classification system may apply conventional classification techniques to the summary to classify the web page. The classification system may use conventional classification techniques such as a Naïve Bayesian classifier or a support vector machine to identify the classifications of a web page based on the summary generated by the summarization system.
摘要:
A method and system for adapting search results of a query to the information needs of the user submitting the query is provided. A search system analyzes click-through triplets indicating that a user submitted a query and that the user selected a document from the results of the query. To overcome the large size and sparseness of the click-through data, the search system when presented with an input triplet comprising a user, a query, and a document determines a probability that the user will find the input document important by smoothing the click-through triplets. The search system then orders documents of the result based on the probability of their importance to the input user.
摘要:
A method and system for adapting search results of a query to the information needs of the user submitting the query is provided. A search system analyzes click-through triplets indicating that a user submitted a query and that the user selected a document from the results of the query. To overcome the large size and sparseness of the click-through data, the search system when presented with an input triplet comprising a user, a query, and a document determines a probability that the user will find the input document important by smoothing the click-through triplets. The search system then orders documents of the result based on the probability of their importance to the input user.
摘要:
A method and system for detecting whether an outgoing communication contains confidential information or other target information is provided. The detection system is provided with a collection of documents that contain confidential information, referred to as “confidential documents.” When the detection system is provided with an outgoing communication, it compares the content of the outgoing communication to the content of the confidential documents. If the outgoing communication contains confidential information, then the detection system may prevent the outgoing communication from being sent outside the organization. The detection system detects confidential information based on the similarity between the content of an outgoing communication and the content of confidential documents that are known to contain confidential information.
摘要:
A method and system for classifying messages of a discussion thread as questions is provided. A classification system generates a classifier to classify messages of discussion threads as question messages or non-question messages. The system trains the classifier using the feature vectors and input classifications derived from a training set of discussion threads. After the classifier is trained, the classification system uses the classifier to classify messages within a corpus of discussion threads as question or non-question messages. To classify a message, the classification system generates a feature vector for the messages and submits that feature vector to the classifier. The classifier generates a score for the message indicating a likelihood that the message is a question message.
摘要:
Systems and methods for clustering-based text classification are described. In one aspect text is clustered as a function of labeled data to generate cluster(s). The text includes the labeled data and unlabeled data. Expanded labeled data is then generated as a function of the cluster(s). The expanded label data includes the labeled data and at least a portion of unlabeled data. Discriminative classifier(s) are then trained based on the expanded labeled data and remaining ones of the unlabeled data.