Systems and methods for automatic data extraction from document images

    公开(公告)号:US11328524B2

    公开(公告)日:2022-05-10

    申请号:US16504838

    申请日:2019-07-08

    申请人: UiPath Inc.

    摘要: Described systems and methods allow the automatic extraction of structured information from images of structured text documents such as invoices and receipts. Some embodiments employ optical character recognition (OCR) technology to extract individual text tokens (e.g., words) and token bounding boxes from a document image. A feature vector of each text token comprises a first part determined according to a character content of the text token, and a second part determined according to an image content of the token's bounding box. A neural network classifier produces a label indicative of a type of information (e.g. “billing address”, “due date”, etc.) carried by each text token. In some embodiments, documents are linearized by ordering text tokens in a sequence according to a reading order of a natural language (e.g., English, Arabic) in which the respective document is formulated. Token feature vectors are fed to the classifier in the order indicated by the token sequence.