Invention Publication
- Patent Title: Context-Aware Text Sanitization
-
Application No.: US18060921Application Date: 2022-12-01
-
Publication No.: US20240184912A1Publication Date: 2024-06-06
- Inventor: Yanfei Dong , Yuan Deng , Soujanya Poria
- Applicant: PayPal, Inc.
- Applicant Address: US CA San Jose
- Assignee: PayPal, Inc.
- Current Assignee: PayPal, Inc.
- Current Assignee Address: US CA San Jose
- Main IPC: G06F21/62
- IPC: G06F21/62 ; G06F40/284 ; G06F40/295

Abstract:
Techniques are disclosed relating to text sanitization. Given textual data, a computer system identifies tokens predicted to constitute sensitive information. Multi-field data structures (e.g., triplets) are generated for the identified tokens that include questions, answers, and corresponding context. These data structures are supplied to a pre-trained multiple-choice question (MCQ) reading comprehension model. The model outputs, for each data structure, a probability that the question and answer for a given data structure, provided the context, is accurate. A post-processing module can then rank probabilities in this set of probabilities and select the multi-field data structure with the highest probability (in some cases, a programmable threshold must also be met). The selected multi-field data structure is then used to select category information to be used in sanitizing the textual data. In this manner, a piece of sensitive data may be replaced by a label that helps retain interpretability of the sanitized text.
Information query