IDENTIFICATION OF READING ORDER TEXT SEGMENTS WITH A PROBABILISTIC LANGUAGE MODEL

    公开(公告)号:US20180267956A1

    公开(公告)日:2018-09-20

    申请号:US15462684

    申请日:2017-03-17

    Abstract: A computer implemented method and system identifies correct structured reading-order sequence of text segments that are extracted from a file structured in a portable document format. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments are provided to the probabilistic model, where the sets of text segments comprise a first set including the first text segment and a first continuation text segment. A second set includes the first text segment and a second continuation text segment. A score is obtained for each set of text segments. The score is indicative of a likelihood of the set providing a correct structured reading-order sequence. The probabilistic language model may be generated in accordance with a Recurrent Neural Network or an n-gram model.

Patent Agency Ranking