Invention Application
- Patent Title: SELECTING CONDITIONALLY INDEPENDENT INPUT SIGNALS FOR UNSUPERVISED CLASSIFIER TRAINING
-
Application No.: US17163243Application Date: 2021-01-29
-
Publication No.: US20220245477A1Publication Date: 2022-08-04
- Inventor: Kave Eshghi , Victor De Vansa Vikramaratne
- Applicant: Box, Inc.
- Applicant Address: US CA Redwood City
- Assignee: Box, Inc.
- Current Assignee: Box, Inc.
- Current Assignee Address: US CA Redwood City
- Main IPC: G06N5/04
- IPC: G06N5/04 ; G06N20/00 ; G06F21/62

Abstract:
Methods, systems, and computer program products for content management systems. An unlabeled dataset comprising documents that at least potentially comprise personally identifiable information (PII) is used when training a PII content classifier. Such a classifier is trained by (1) determining, based on applying a PII rule to a first portion of a document selected from the unlabeled dataset, a confidence value that the first portion of the document does contain personally identifiable information, (2) selecting a second portion of the document selected from the unlabeled dataset such that the second portion does not include the first portion; and (3) assigning, based on the confidence value, a likelihood value that corresponds to whether characteristics of the second portion are indicative that the document does contain personally identifiable information. Such a PII content classifier is used over selected portions of subject content objects to determine whether the selected portions contain PII.
Information query