摘要:
Among other disclosed subject matter, a computer-method includes receiving a plurality of documents at a server and adding meta-data to each of the plurality of documents. The meta-data added to a particular document comprises at least one of task flow features of the particular document or data associated with an author of the particular document. The method also includes selecting a plurality of features for use in clustering the plurality of documents. The plurality of features includes a subset of the meta-data and a subset of content associated with one or more of the plurality of documents. The method also includes clustering the plurality of documents based on the plurality of features including identifying a topic associated with each cluster, and preparing a report based on the clusters and metric information associated with each cluster. The method also includes displaying the report to a user.
摘要:
A method and system of classifying documents is provided. The method includes receiving a stream of documents from at least one user wherein each document includes a topic of information relating to a customer support issue or sentiment. The method includes classifying each of the received documents using a plurality of trained classifiers, the classification based on a voting by the trained classifiers, each document labeled according to a similar topic. A drift of the topic of one or more of the classifications is determined wherein the drift is related to the received documents that include information relating to an unclassified customer support issue or sentiment. If the determined drift exceeds a predetermined threshold range, rebuilding the plurality of classifiers to include a second set of classifiers trained to recognize the unclassified customer support issue or sentiment.
摘要:
Methods and systems for constructing a taxonomy based on hierarchical clustering are provided. The taxonomy is generated by first constructing a hierarchy of clusters using a clustering algorithm. A first level of the hierarchy of clusters is generated by providing a plurality of content files to a clustering algorithm. Subsequent levels of the hierarchy are generated by providing the clusters of the preceding levels to the clustering algorithm. Labels that characterize each cluster within the hierarchy are assigned to corresponding clusters. Labels and clusters are combined to form the taxonomy.
摘要:
Among other disclosed subject matter, a computer-implemented method includes receiving a plurality of electronic documents associated with a domain at a server. Each of the plurality of electronic documents includes meta-data and textual content. The method includes identifying one or more text strings in the textual content that are to be processed differently than an identical or similar text string in other electronic documents, and associating, with the electronic document, data indicating that each of the identified text strings is to be processed differently than an identical or similar text string in other electronic documents. The method also includes performing an analysis of the electronic documents to identify one or more subsets of the electronic documents that include related subject matter. A plurality of degrees of relatedness can be associated with text strings associated with data indicating that each of the text strings is to be processed differently.
摘要:
A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified document is associated with a classification within the taxonomy and a classification confidence level. One or more classified documents that are associated with a classification confidence level below a predetermined threshold value are selected to create a set of low-confidence documents. The low-confidence documents are disassociated from each of the associated classifications. A user is prompted to enter a classification within the taxonomy for at least one low-confidence document. The low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document.
摘要:
Among other disclosed subject matter, a computer-implemented method includes receiving one or more keywords and identifying a plurality of content items. The content items comprise network content that includes the one or more keywords. The method also includes clustering the plurality of content items and identifying a topic associated with each cluster. The method also includes determining a relative importance of a particular topic and analyzing clusters associated with the particular topic to determine opinion data associated with the particular topic. The method includes preparing a report based on the clusters, relative importance and the opinion data and display the report to a user.
摘要:
Methods and systems for use in partitioning documents having customer feedback and support content are provided. One exemplary computer-implemented method including executing instructions stored on a computer-readable medium includes receiving a plurality of documents, at least a portion of the plurality of documents including customer feedback related to an issue and support content responsive to the customer feedback, filtering the plurality of documents to retain one of the customer feedback and the support content within a plurality of filtered documents, partitioning the plurality of filtered documents into multiple clusters, receiving a new document, and partitioning the new document based on at least one keyword included in one of the multiple clusters of filtered documents.
摘要:
A computer-implemented method executes instructions stored on a computer-readable medium. The method includes accessing a hierarchy of clusters, wherein each cluster includes at least one content file, and a label is associated with each cluster. The method further includes calculating a topic purity score for each cluster, and selecting a first cluster and a second cluster from the hierarchy of clusters, wherein the topic purity score of the first cluster and the second cluster are less than a purity threshold. The method also includes creating a third cluster by combining the content files included within the first cluster and the second cluster, determining a parent category of the first cluster and the second cluster, wherein the parent category is at a level within the hierarchy higher than a level of the first cluster and the second cluster, and associating a label of the parent category with the third cluster.
摘要:
A computer-implemented method includes receiving, by one or more computer systems, first information from a first channel and second information from a second channel; merging the first information with the second information; applying an unsupervised clustering model to the merged information; and generating, based on results of the applying, a cross-channel cluster, the cross-channel cluster including (i) a portion of the first information associated with a subject matter, and (ii) a portion of the second information associated with the subject matter.
摘要:
Among other disclosed subject matter, a computer-implemented method that includes receiving a set of clusters of documents and calculating a similarity score for each cluster wherein the similarity score is based at least in part on features included in the documents in the cluster and indicates a measure of similarity of the documents in the cluster. For each cluster associated with a respective similarity score greater than a first threshold, identifying the cluster as satisfying a quality assurance requirement. For each cluster associated with a respective similarity score less than a second threshold, identifying the cluster as failing the quality assurance requirement. For each cluster associated with a similarity score less than or equal to the first threshold value and greater than or equal to the second threshold value, reviewing at least a subset of documents in the cluster to determine whether the cluster satisfies the quality assurance requirement.