METHODS AND SYSTEMS FOR AUTOMATED DOCUMENT CLASSIFICATION WITH PARTIALLY LABELED DATA USING SEMI-SUPERVISED LEARNING

    公开(公告)号:US20220036134A1

    公开(公告)日:2022-02-03

    申请号:US16945420

    申请日:2020-07-31

    Applicant: NetApp, Inc.

    Abstract: A method, a computing device, and a non-transitory machine-readable medium for classifying documents. A document collection is sorted into a plurality of categories. A classifier corresponding to a category of the plurality of categories is trained to output a probability that a document associated with the category is of a selected type (e.g., confidential). The training includes determining, by the processor, that a cardinality of a set of negative samples in a train set is not above a pipeline threshold but is at least one and training the classifier via a first pipeline and a second pipeline using a training group that includes a first portion of a group of positive samples in the train set, a second portion of a set of negative samples in the train set, and a third portion of a group of unlabeled samples in the train set

Patent Agency Ranking