-
公开(公告)号:US20240221407A1
公开(公告)日:2024-07-04
申请号:US18149795
申请日:2023-01-04
Applicant: Oracle International Corporation
Inventor: Yazhe Hu , Jeaff Wang , Mengqing Guo , Tao Sheng , Jun Qian
IPC: G06V30/19 , G06F40/284 , G06F40/30 , G06N3/08 , G06V30/14
CPC classification number: G06V30/19147 , G06F40/284 , G06F40/30 , G06N3/08 , G06V30/1448
Abstract: Techniques for multi-stage training of a machine learning model to extract key-value pairs from documents are disclosed. A system trains a machine learning model using a set of training data including unlabeled documents of various document categories. The initial stage identifies relationships among tokens, or words, numbers, and punctuation, in documents. The system re-trains the machine learning model using a set of training data which includes a particular category of documents while excluding other categories of documents. The second training stage is a supervised machine learning stage in which the training data is labeled to identify key-value pairs in the documents. In the initial training stage, the system sets parameters of the machine learning model to an initial state. In the second stage, the system modifies the parameters of the machine learning model based on the characteristics of the training data set including the documents of the particular category.