Detecting the bounds of borderless tables in fixed-format structured documents using machine learning
Abstract:
Techniques are disclosed for detecting the bounds of borderless open tables in fixed-format structured documents, such as PDF documents, and grouping text lines into predicted borderless tables. The target document comprises a set of text lines each having a respective vertical and horizontal position in the target document. A sorted list of the text lines is generated based upon a vertical and horizontal position of each text line in the target document. For each text line in the sorted list, a respective probability that the text line in the sorted list belongs to a borderless table is then determined. According to one embodiment, the probability may be determined using a classifier that may employ a logistic regression algorithm.
Information query
Patent Agency Ranking
0/0