-
公开(公告)号:US20230334309A1
公开(公告)日:2023-10-19
申请号:US17720658
申请日:2022-04-14
Applicant: SAP SE
Inventor: Alexey Streltsov , Monit Shah Singh , Dhananjay Tomar , Christian Reisswig , Minh Duc Bui
IPC: G06N3/08
CPC classification number: G06N3/08
Abstract: Systems, methods, and computer-readable media for generating a synthetic training data set from an original unstructured electronic document are disclosed. The synthetic training data set may be used to train a deep learning model to extract data from the original electronic document. The original electronic document may comprise annotated data fields. Each annotated data field may comprise a bounding box and a label. The original electronic document may comprise a header, a table, and a footer. Macro augmentation operations may be applied to the original electronic document to create sub-templates representative of distinct page layouts in the original electronic document. The synthetic training data set may be generated by applying geometric and semantic data augmentations to the sub-templates and the original electronic documents. The synthetic training data set may then be provided the deep learning model for training.