摘要:
An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues, receiving an output of a categorization process applied to each document in training and control subsets of the at least N documents, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication; building a text classifier simulating the categorization process using the output for all documents in the training subset of documents; and running the text classifier on the at least N documents thereby to obtain a ranking of the extent of relevance of each of the at least N documents to the individual issue. The method may also comprise evaluating the text classifier's quality using the output for all documents in the control subset.
摘要:
A system configured to find near duplicate documents. For each two (or more) documents that are similar to each other, the system is configured to identify which of the differences is likely to be generated by an Optical Character Recognition software or otherwise due to difference between the original documents. As a result, the process of identifying similarity between documents is improved by identifying documents that were originally exact duplicates but are different one with respect to the other only due to OCR errors, or correct the similarity level between the documents by correcting errors introduced by the OCR tool.
摘要:
A computerized system for enhancing expert-based processes, the system comprising a computerized expert based data analyzer receiving input from a plurality of experts by operating a corresponding plurality of expert-based processes on a body of data, the input including a discrepancy set including at least one point of discrepancy regarding which less than all of the plurality of experts agree and an agreement set including at least one point of agreement regarding which all of the plurality of experts agree; and an oracle from which oracle input is received resolving at least the point of discrepancy and not resolving any point of agreement in the agreement set; wherein the computerized analyzer is operative to select and to subsequently actuate for purposes of receiving input regarding the body of data, a subset of better experts from among the plurality of experts based on the oracle input.