EFFICIENT DOCUMENT INFORMATION EXTRACTION SYSTEM USING OPTICAL CHARACTER RECOGNITION (OCR) INFORMATION

    公开(公告)号:US20240177515A1

    公开(公告)日:2024-05-30

    申请号:US17994977

    申请日:2022-11-28

    Applicant: SAP SE

    CPC classification number: G06V30/414 G06V30/19007

    Abstract: Embodiments are described for a system comprising a memory and at least one processor coupled to the memory. The at least one processor is configured to receive optical character recognition (OCR) information of a document and determine a beginning, inside, and outside (BIO) tags and labels of the one or more word boxes based on the OCR information. The at least one processor is further configured to group a first word box and a second word box based on BIO tags of the first and the second word boxes and merge the first and the second word boxes into a combined word box based on a label of the first word box matching a label of the second word box. Finally, the at least one processor is configured to output the combined word box and the label of the first word box.

    NEURAL NETWORK WORD CLUSTERING SYSTEM
    2.
    发明公开

    公开(公告)号:US20240177011A1

    公开(公告)日:2024-05-30

    申请号:US18071231

    申请日:2022-11-29

    Applicant: SAP SE

    CPC classification number: G06N3/09

    Abstract: Various embodiments for a neural network clustering system are described herein. An embodiment operates by detecting a plurality of bounding boxes and identifying coordinates for each of the bounding boxes. An adjacency matrix is generated based on combining a key matrix and a query matrix. The plurality of words are clustered into a plurality of clusters, each cluster corresponding to a different line on the first document. A second document is generated in which the plurality of words corresponding to a respective cluster of the plurality of clusters is arranged on a same line on the second document. The second document is provided for display.

Patent Agency Ranking