摘要:
A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of the data structure and capable of displaying selected data on a display unit. The data structure includes a plurality of directionally interlinked nodes, each node being associated with one or more documents having a header and body text. All the documents are associated with a given node and have identical normalized body text. All documents that have identical normalized body text are associated with the same node. One or more of the nodes is associated with more than one document. For any node that is a descendent of another node, the normalized body text of each document associated with the node is inclusive of the normalized body text of a document that is associated with the other node.
摘要:
A method for computerized batching of huge populations of electronic documents, including computerized assignment of electronic documents into at least one sequence of electronic document batches such that each document is assigned to a batch in the sequence of batches and such that there is no conflict between batching requirements, the following batching requirements being maintained by a suitably programmed processor: a. pre-defined subsets of documents are always kept together in the same batch, b. batches are equal in size, c. the population is partitioned into clusters, and all documents in any given batch belong to a single cluster rather than to two or more clusters.
摘要:
An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues, receiving an output of a categorization process applied to each document in training and control subsets of the at least N documents, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication; building a text classifier simulating the categorization process using the output for all documents in the training subset of documents; and running the text classifier on the at least N documents thereby to obtain a ranking of the extent of relevance of each of the at least N documents to the individual issue. The method may also comprise evaluating the text classifier's quality using the output for all documents in the control subset.
摘要:
System and method for computerized identification of themes in a large data set, the system comprising reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.
摘要:
An information governance system comprising a plurality of classifiers which employ cutoffs for classifying at least a portion of a population of incoming documents as documents to be retained and documents to be discarded in accordance with a corresponding plurality of pre-defined retention schedules; training apparatus for training said classifiers based on relevance inputs provided by a human information governance expert regarding a training set of documents within a universe of documents to be governed; and apparatus operative to automatically cause any classified document to be retained and subsequently discarded in accordance with its pre-defined retention schedule including discarding only documents that (a) have been classified as documents to be discarded and (b) have not been classified as documents to be retained, and to automatically cause any document which could not be classified, to be retained as gray area data until further notice.
摘要:
An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues, receiving an output of a categorization process applied to each document in training and control subsets of the at least N documents, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication; building a text classifier simulating the categorization process using the output for all documents in the training subset of documents; and running the text classifier on the at least N documents thereby to obtain a ranking of the extent of relevance of each of the at least N documents to the individual issue. The method may also comprise evaluating the text classifier's quality using the output for all documents in the control subset.
摘要:
An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues, receiving an output of a categorization process applied to each document in training and control subsets of the at least N documents, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication; building a text classifier simulating the categorization process using the output for all documents in the training subset of documents; and running the text classifier on the at least N documents thereby to obtain a ranking of the extent of relevance of each of the at least N documents to the individual issue. The method may also comprise evaluating the text classifier's quality using the output for all documents in the control subset.
摘要:
System and method for computerized identification and presentation of semantic themes occurring in a set of electronic documents, comprising performing topic modeling on the set of documents thereby to yield a set of topics and for each topic, a topic-modeling output list of words; and using a processor performing a matching algorithm to match only a subset of each topic-modeling output list of words, to the output list's corresponding topic, such that each word appears in no more than a predetermined number of subsets from among said subsets.