摘要:
A system, method, data processing apparatus, and article of manufacture are provided for classifying data. Labeled data points are received, each of the labeled data points having at least one label indicating whether the data point is a training example for data points for being included in a designated category or a training example for data points being excluded from a designated category; receiving unlabeled data points; receiving at least one predetermined cost factor of the labeled data points and unlabeled data points; training a transductive classifier using MED through iterative calculation using the at least one cost factor and the labeled data points and the unlabeled data points as training examples; applying the trained classifier to classify at least one of the unlabeled data points, the labeled data points, and input data points; and outputting a classification of the classified data points, or derivative thereof.
摘要:
A method is provided for organizing data sets. In use, an automatic decision system is created or updated for determining whether data elements fit a predefined organization or not, where the decision system is based on a set of preorganized data elements. A plurality of data elements is organized using the decision system. At least one organized data element is selected for output to a user based on a score or confidence from the decision system for the at feast one organized data element. Additionally, at least a portion of the at least one organized data element is output to the user. A response is received from the user comprising at least one of a confirmation, modification, and a negation of the organization of the at least one organized data element. The automatic decision system is recreated or updated based on the user response. Other embodiments are also presented.
摘要:
A method for adapting to a shift in document content according to one embodiment of the present invention includes receiving at least one labeled seed document; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; reclassifying at least some of the categorized documents into the categories using the classifier; and outputting identifiers of the categorized documents to at least one of a user, another system, and another process. Methods for separating documents are also presented. Methods for document searching are also presented.
摘要:
A method according to one embodiment includes extracting an identifier from an electronic first document, and identifying a complementary document associated with the first document using the identifier. A validity of the first document is determined by simultaneously considering: textual information from the first document; textual information from the complementary document; and predefined business rules. An indication of the determined validity is output. Systems and computer program products for providing, performing, and/or enabling the methodology presented above are also presented.
摘要:
An improved method of classifying examples into multiple categories using a binary support vector machine (SVM) algorithm. In one preferred embodiment, the method includes the following steps: storing a plurality of user-defined categories in a memory of a computer, analyzing a plurality of training examples for each category so as to identify one or more features associated with each category; calculating at least one feature vector for each of the examples; transforming each of the at least one feature vectors so as reflect information about all of the training examples; and building a SVM classifier for each one of the plurality of categories, wherein the process of building a SVM classifier further includes: assigning each of the examples in a first category to a first class and all other examples belonging to other categories to a second class, wherein if anyone of the examples belongs to another category as well as the first category, such examples are assigned to the first class only, optimizing at least one tunable parameter of a SVM classifier for the first category, wherein the SVM classifier is trained using the first and second classes; and optimizing a function that converts the output of the binary SVM classifier into a probability of category membership.
摘要:
A method and system for delineating document and/or subdocument boundaries and identifying document and/or subdocument types, the method comprising: automatically generating at least one identifier for identifying which of a plurality of document and/or subdocument images belongs to which of a plurality of categories. The method and/or system optionally may include automatically categorizing a plurality of document and/or subdocument images into a plurality of predetermined categories in accordance with classification rules for said categories.
摘要:
A method is provided for organizing data sets. In use, an automatic decision system is created or updated for determining whether data elements fit a predefined organization or not, where the decision system is based on a set of preorganized data elements. A plurality of data elements is organized using the decision system. At least one organized data element is selected for output to a user based on a score or confidence from the decision system for the at least one organized data element. Additionally, at least a portion of the at least one organized data element is output to the user. A response is received from the user comprising at least one of a confirmation, modification, and a negation of the organization of the at least one organized data element. The automatic decision system is recreated or updated based on the user response. Other embodiments are also presented.
摘要:
Systems, methods and computer program products for classifying documents are presented. Systems, methods and computer program products for analyzing documents, e.g., associated with legal discovery are also presented. Systems, methods and computer program products for cleaning up data are also presented. Systems, methods and computer program products for verifying an association of an invoice with an entity are also presented. Systems, methods and computer program products for managing medical records are presented. Systems, methods and computer program products for face recognition are presented.
摘要:
A system and article of manufacture enabling adapting to a shift in document content according to one embodiment of the present invention includes instructions for: receiving at least one labeled seed document; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; reclassifying at least some of the categorized documents into the categories using the classifier; and outputting identifiers of the categorized documents to at least one of a user, another system, and another process. Systems and articles of manufacture for separating documents are also presented. Systems and articles of manufacture for document searching are also presented.
摘要:
An efficient method and system to enhance digital acquisition devices for analog data is presented. The enhancements offered by the method and system are available to the user in local as well as in remote deployments yielding efficiency gains for a large variety of business processes. The quality enhancements of the acquired digital data are achieved efficiently by employing virtual reacquisition. The method of virtual reacquisition renders unnecessary the physical reacquisition of the analog data in case the digital data obtained by the acquisition device are of insufficient quality. The method and system allows multiple users to access the same acquisition device for analog data. In some embodiments, one or more users can virtually reacquire data provided by multiple analog or digital sources. The acquired raw data can be processed by each user according to his personal preferences and/or requirements. The preferred processing settings and attributes are determined interactively in real time as well as non real time, automatically and a combination thereof.