MAPPING ENTITIES IN UNSTRUCTURED TEXT DOCUMENTS VIA ENTITY CORRECTION AND ENTITY RESOLUTION

    公开(公告)号:US20230267274A1

    公开(公告)日:2023-08-24

    申请号:US17813384

    申请日:2022-07-19

    申请人: OneTrust LLC

    摘要: Methods, systems, and non-transitory computer readable storage media are disclosed for correcting entity detection errors with entity correction and resolution in optical character recognition for digitization of physical documents. Specifically, the disclosed system utilizes named entity recognition to extract entities from character strings (e.g., words) in a digital text document. The disclosed system also tokenizes the character strings in the digital text document based on attributes of the character strings. Furthermore, the disclosed system compares the extracted entities and tokenized character strings to determine similarity metrics between the extracted entities and tokenized character strings. The disclosed system also compares extracted entities to character strings including special/numerical characters to determine similarity metrics indicating correlation probabilities between entities and character strings. The disclosed systems generate mappings between the tokens and entities based on the similarity metrics to resolve entities to likely corresponding character strings while correcting for errors during entity extraction.