AUGMENTING ELECTRONIC DOCUMENTS TO GENERATE SYNTHETIC TRAINING DATA SETS

    公开(公告)号:US20230334309A1

    公开(公告)日:2023-10-19

    申请号:US17720658

    申请日:2022-04-14

    Applicant: SAP SE

    CPC classification number: G06N3/08

    Abstract: Systems, methods, and computer-readable media for generating a synthetic training data set from an original unstructured electronic document are disclosed. The synthetic training data set may be used to train a deep learning model to extract data from the original electronic document. The original electronic document may comprise annotated data fields. Each annotated data field may comprise a bounding box and a label. The original electronic document may comprise a header, a table, and a footer. Macro augmentation operations may be applied to the original electronic document to create sub-templates representative of distinct page layouts in the original electronic document. The synthetic training data set may be generated by applying geometric and semantic data augmentations to the sub-templates and the original electronic documents. The synthetic training data set may then be provided the deep learning model for training.

Patent Agency Ranking