-
公开(公告)号:US20240289551A1
公开(公告)日:2024-08-29
申请号:US18240480
申请日:2023-08-31
Applicant: Oracle International Corporation
Inventor: Amit Agarwal , Srikant Panda , Deepak Karmakar , Kulbhushan Pachauri
IPC: G06F40/284
CPC classification number: G06F40/284
Abstract: In some implementations, techniques described herein may include identifying text in a visually rich document and determining a sequence for the identified text. The techniques may include selecting a language model based at least in part on the identified text and the determined sequence. Moreover, the techniques may include assigning each word of the identified text to a respective token to generate textual features corresponding to the identified text. The techniques may include extracting visual features corresponding to the identified text. The techniques may include determining positional features for each word of the identified text. The techniques may include generating a graph representing the visually rich document, each node in the graph representing each of the visual features, textual features, and positional features of a respective word of the identified text. The techniques may include training a classifier on the graph to classify each respective word of the identified text.