-
公开(公告)号:US20210141781A1
公开(公告)日:2021-05-13
申请号:US16680302
申请日:2019-11-11
Applicant: salesforce.com, inc.
Inventor: Ankit CHADHA , Zeyuan CHEN , Caiming XIONG , Ran XU , Richard SOCHER
Abstract: Embodiments described herein provide unsupervised density-based clustering to infer table structure from document. Specifically, a number of words are identified from a block of text in an noneditable document, and the spatial coordinates of each word relative to the rectangular region are identified. Based on the word density of the rectangular region, the words are grouped into clusters using a heuristic radius search method. Words that are grouped into the same cluster are determined to be the element that belong to the same cell. In this way, the cells of the table structure can be identified. Once the cells are identified based on the word density of the block of text, the identified cells can be expanded horizontally or grouped vertically to identify rows or columns of the table structure.