Automated generation of structured training data from unstructured documents

    公开(公告)号:US11244203B2

    公开(公告)日:2022-02-08

    申请号:US16784726

    申请日:2020-02-07

    Abstract: Methods, systems and computer program products for automatically generating structured training data based on an unstructured document are provided. Aspects include receiving an unstructured document and a corresponding structured document that includes labeled portions. Aspects also include generating a parsed document that has one or more extracted objects by applying a parsing tool to the unstructured document. Aspects also include identifying one or more matching extracted objects by applying a matching algorithm to the structured document and the parsed document. Each matching extracted object is an extracted object of the parsed document that corresponds to a labeled portion of the structured document. Aspects also include annotating a region of the unstructured document that corresponds to the bounding box of the respective matching extracted object with a respective label of the corresponding labeled portion of the unstructured document.

    Automatic delineation and extraction of tabular data using machine learning

    公开(公告)号:US11380116B2

    公开(公告)日:2022-07-05

    申请号:US16659977

    申请日:2019-10-22

    Abstract: A computer-implemented method for using a machine learning model to automatically extract tabular data from an image includes receiving a set of images of tabular data and a set of markup data corresponding respectively to the images of tabular data. The method further includes training a first neural network to delineate the tabular data into cells using the markup data, and training a second neural network to determine content of the cells in the tabular data using the markup data. The method further includes, upon receiving an input image containing a first tabular data without any markup data, generating an electronic output corresponding to the first tabular data by determining the structure of the first tabular data using the first neural network and extracting content of the first tabular data using the second neural network.

    MULTI-MODEL, MULTI-TASK TRAINED NEURAL NETWORK FOR ANALYZING UNSTRUCTURED AND SEMI-STRUCTURED ELECTRONIC DOCUMENTS

    公开(公告)号:US20210286989A1

    公开(公告)日:2021-09-16

    申请号:US16815391

    申请日:2020-03-11

    Abstract: Embodiments of the invention describe a computer-implemented method of analyzing an electronic version of a document. The computer-implemented method can include an architecture of machine learning sub-models that performs the global task of translating unstructured and semi-structured inputs into numerical representations that can be recognized and manipulated by a content-analysis (CA) sub-model without relying on brute force analysis. Embodiments of the invention achieve these results by separating the global task into auxiliary tasks and assigning each sub-model to at least one of the auxiliary tasks. The auxiliary tasks can include parsing the unstructured or semi-structured inputs into format types (e.g., lists, tables, figures, text, etc. of a PDF document), extracting features of the parsed document, and performing a computer-based CA on the extracted features. The sub-models are trained in stages and in groups, wherein both the stages and the groupings are based on the complexity of the sub-model's assigned task.

    AUTOMATED GENERATION OF STRUCTURED TRAINING DATA FROM UNSTRUCTURED DOCUMENTS

    公开(公告)号:US20210248420A1

    公开(公告)日:2021-08-12

    申请号:US16784726

    申请日:2020-02-07

    Abstract: Methods, systems and computer program products for automatically generating structured training data based on an unstructured document are provided. Aspects include receiving an unstructured document and a corresponding structured document that includes labeled portions. Aspects also include generating a parsed document that has one or more extracted objects by applying a parsing tool to the unstructured document. Aspects also include identifying one or more matching extracted objects by applying a matching algorithm to the structured document and the parsed document. Each matching extracted object is an extracted object of the parsed document that corresponds to a labeled portion of the structured document. Aspects also include annotating a region of the unstructured document that corresponds to the bounding box of the respective matching extracted object with a respective label of the corresponding labeled portion of the unstructured document.

Patent Agency Ranking