-
1.
公开(公告)号:US20180373952A1
公开(公告)日:2018-12-27
申请号:US15630779
申请日:2017-06-22
Applicant: ADOBE SYSTEMS INCORPORATED
Inventor: Trung Huu Bui , Hung Hai Bui , Shawn Alan Gaither , Walter Wei-Tuh Chang , Michael Frank Kraley , Pranjal Daga
Abstract: The present invention is directed towards providing automated workflows for the identification of a reading order from text segments extracted from a document. Ordering the text segments is based on trained natural language models. In some embodiments, the workflows are enabled to perform a method for identifying a sequence associated with a portable document. The methods includes iteratively generating a probabilistic language model, receiving the portable document, and selectively extracting features (such as but not limited to text segments) from the document. The method may generate pairs of features (or feature pair from the extracted features). The method may further generate a score for each of the pairs based on the probabilistic language model and determine an order to features based on the scores. The method may provide the extracted features in the determined order.
-
公开(公告)号:US20180267956A1
公开(公告)日:2018-09-20
申请号:US15462684
申请日:2017-03-17
Applicant: Adobe Systems Incorporated
Inventor: Walter Chang , Trung Bui , Pranjal Daga , Michael Kraley , Hung Bui
IPC: G06F17/27
CPC classification number: G06F17/2775 , G06F17/217 , G06F17/218 , G06F17/2229 , G06F17/2264 , G06F17/2715 , G06F17/277 , G06K9/00469 , G06N3/0445 , G06N3/084 , G06N7/005
Abstract: A computer implemented method and system identifies correct structured reading-order sequence of text segments that are extracted from a file structured in a portable document format. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments are provided to the probabilistic model, where the sets of text segments comprise a first set including the first text segment and a first continuation text segment. A second set includes the first text segment and a second continuation text segment. A score is obtained for each set of text segments. The score is indicative of a likelihood of the set providing a correct structured reading-order sequence. The probabilistic language model may be generated in accordance with a Recurrent Neural Network or an n-gram model.
-