SELECTING CONDITIONALLY INDEPENDENT INPUT SIGNALS FOR UNSUPERVISED CLASSIFIER TRAINING
Abstract:
Methods, systems, and computer program products for content management systems. An unlabeled dataset comprising documents that at least potentially comprise personally identifiable information (PII) is used when training a PII content classifier. Such a classifier is trained by (1) determining, based on applying a PII rule to a first portion of a document selected from the unlabeled dataset, a confidence value that the first portion of the document does contain personally identifiable information, (2) selecting a second portion of the document selected from the unlabeled dataset such that the second portion does not include the first portion; and (3) assigning, based on the confidence value, a likelihood value that corresponds to whether characteristics of the second portion are indicative that the document does contain personally identifiable information. Such a PII content classifier is used over selected portions of subject content objects to determine whether the selected portions contain PII.
Information query
Patent Agency Ranking
0/0