Method for learning to infer the topical content of documents based upon
their lexical content
    1.
    发明授权
    Method for learning to infer the topical content of documents based upon their lexical content 失效
    基于其词汇内容学习推断文档主题内容的方法

    公开(公告)号:US5687364A

    公开(公告)日:1997-11-11

    申请号:US308037

    申请日:1994-09-16

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3071 Y10S707/99936

    摘要: An unsupervised method of learning the relationships between words and unspecified topics in documents using a computer is described. The computer represents the relationships between words and unspecified topics via word clusters and association strength values, which can be used later during topical characterization of documents. The computer learns the relationships between words and unspecified topics in an iterative fashion from a set of learning documents. The computer preprocesses the training documents by generating an observed feature vector for each document of the set of training documents and by setting association strengths to initial values. The computer then determines how well the current association strength values predict the topical content of all of the learning documents by generating a cost for each document and summing the individual costs together to generate a total cost. If the total cost is excessive, the association strength values are modified and the total cost recalculated. The computer continues calculating total cost and modifying association strength values until a set of association strength values are discovered that adequately predict the topical content of the entire set of learning documents.

    摘要翻译: 描述了使用计算机学习文档和未指定主题之间的关系的无监督方法。 计算机通过单词集合和关联强度值来表示单词和未指定主题之间的关系,可以在文档的主题表征时稍后使用。 计算机从一组学习文档中以迭代的方式学习单词和未指定主题之间的关系。 计算机通过为训练文档集合中的每个文档生成观察到的特征向量并通过将关联强度设置为初始值来预处理训练文档。 然后,计算机确定当前关联强度值如何通过生成每个文档的成本并将各个成本相加在一起以产生总成本来预测所有学习文档的主题内容。 如果总成本过高,则关联强度值被修改,并重新计算总成本。 计算机继续计算总成本并修改关联强度值,直到发现一组足够预测整套学习文档的主题内容的关联强度值。

    Method and apparatus for inferring the topical content of a document
based upon its lexical content without supervision
    2.
    发明授权
    Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision 失效
    基于其词汇内容而无需监督地推断文档的主题内容的方法和装置

    公开(公告)号:US5659766A

    公开(公告)日:1997-08-19

    申请号:US307221

    申请日:1994-09-16

    IPC分类号: G06F17/30 G06F17/28

    CPC分类号: G06F17/3071

    摘要: An iterative method of determining the topical content of a document using a computer. The processing unit of the computer determines the topical content of documents presented to it in machine readable form using information stored in computer memory. That information includes word-clusters, a lexicon, and association strength values. The processing unit beings by generating an observed feature vector for the document being characterized, which indicates which of the words of the lexicon appear in the document. Afterward, the processing unit makes an initial prediction of the topical content of the document in the form of a topic belief vector. The processing unit uses the topic belief vector and the association strength values to predict which words of the lexicon should appear in the document. This prediction is represented via a predicted feature vector. The predicted feature vector is then compared to the observed feature vector to measure how well the topic belief vector models the topical content of the document. If the topic belief vector adequately model the topical content of the document, then the processing unit's task is complete. On the other hand, if the topic belief vector does not adequately model the topical content of the document, then the processing unit determines how the topic belief vector should be modified to improve the prediction of modeling of the topical content.

    摘要翻译: 使用计算机确定文档的主题内容的迭代方法。 计算机的处理单元使用存储在计算机存储器中的信息以机器可读形式确定呈现给它的文档的主题内容。 该信息包括词群,词典和关联强度值。 处理单元通过生成用于表征的文档的观察特征向量来生成,该特征向量指示文档中出现的词典中的哪一个。 之后,处理单元以主题置信向量的形式对文档的主题内容进行初始预测。 处理单元使用主题置信向量和关联强度值来预测词典中应该出现在文档中的哪个词。 该预测通过预测特征向量来表示。 然后将预测特征向量与观察到的特征向量进行比较,以测量主题信念向量对文档的主题内容进行建模的程度。 如果主题信念向量充分地模拟文档的主题内容,则处理单元的任务完成。 另一方面,如果主题信念向量不能对文档的主题内容进行充分的建模,则处理单元确定如何修改主题置换向量以改进对主题内容的建模的预测。

    System and method for forms recognition by synthesizing corrected localization of data fields
    3.
    发明授权
    System and method for forms recognition by synthesizing corrected localization of data fields 有权
    通过合成数据字段的校正定位来进行表单识别的系统和方法

    公开(公告)号:US09536141B2

    公开(公告)日:2017-01-03

    申请号:US13537729

    申请日:2012-06-29

    申请人: Eric Saund

    发明人: Eric Saund

    IPC分类号: G06F17/00 G06K9/00 G06F17/24

    摘要: A method and system generates an idealized image of a form. An image of a form and a template model of the form are received. The form includes data fields. Word boxes of the image are identified. The word boxes are assigned to corresponding data fields of the form. An idealized image of the from is generated based on the assignments and the template model.

    摘要翻译: 一种方法和系统产生一个形式的理想化图像。 接收表单的图像和表单的模板模型。 表单包括数据字段。 识别图像的字框。 单词框被分配给表单的相应数据字段。 基于分配和模板模型生成来自的理想化图像。

    Method for generating a graph lattice from a corpus of one or more data graphs

    公开(公告)号:US08872828B2

    公开(公告)日:2014-10-28

    申请号:US12883464

    申请日:2010-09-16

    申请人: Eric Saund

    发明人: Eric Saund

    IPC分类号: G06T17/20 G06T11/20

    CPC分类号: G06T11/206

    摘要: A document recognition system and method, where images are represented as a collection of primitive features whose spatial relations are represented as a graph. Useful subsets of all the possible subgraphs representing different portions of images are represented over a corpus of many images. The data structure is a lattice of subgraphs, and algorithms are provided means to build and use the graph lattice efficiently and effectively.

    System and method for forms classification by line-art alignment
    6.
    发明授权
    System and method for forms classification by line-art alignment 有权
    通过线条对齐形式分类的系统和方法

    公开(公告)号:US08792715B2

    公开(公告)日:2014-07-29

    申请号:US13539941

    申请日:2012-07-02

    IPC分类号: G06K9/00

    CPC分类号: G06K9/00449

    摘要: A system and method to classify forms. An image representing a form of an unknown document type is received. The image includes line-art. Further, a plurality of template models corresponding to a plurality of different document types is received. The plurality of different document types is intended to include the correct document type of the unknown document. A subset of the plurality of template models are selected as candidate template models. The candidate template models include line-art junctions best matching line-art junctions of the received image. One of the candidate template models is selected as a best candidate template model. The best candidate template model includes horizontal and vertical lines best matching horizontal and vertical lines of the received image, respectively, aligned to the best candidate template model.

    摘要翻译: 一种用于分类表单的系统和方法。 接收到表示未知文档类型的形式的图像。 图像包括线条艺术。 此外,接收对应于多个不同文档类型的多个模板模型。 多个不同的文档类型旨在包括未知文档的正确文档类型。 选择多个模板模型的子集作为候选模板模型。 候选模板模型包括最佳匹配接收图像的线艺术结的线艺术结。 选择候选模板模型之一作为最佳候选模板模型。 最佳候选模板模型包括分别与最佳候选模板模型对齐的最佳匹配接收图像的水平和垂直线的水平和垂直线。

    System and method for localizing data fields on structured and semi-structured forms
    7.
    发明授权
    System and method for localizing data fields on structured and semi-structured forms 有权
    用于本地化结构化和半结构化形式的数据字段的系统和方法

    公开(公告)号:US08781229B2

    公开(公告)日:2014-07-15

    申请号:US13537630

    申请日:2012-06-29

    申请人: Eric Saund

    发明人: Eric Saund

    IPC分类号: G06K9/34

    摘要: A method and system to localize data fields of a form. An image of a form is received, where the form includes data fields. Word boxes of the image are identified. The word boxes are grouped into candidate zones, where each of the candidate zones includes one or more of the word boxes. Hypotheses are formed from the data fields and the candidate zones, where each hypothesis assigns one of the candidate zones to one of the data fields or a null data field. A constrained optimization search of the hypotheses is performed for an optimal set of hypotheses. The optimal set of hypotheses assigns word box groups to corresponding data fields.

    摘要翻译: 本地化表单数据字段的方法和系统。 收到表单的图像,其中表单包括数据字段。 识别图像的字框。 单词框被分组成候选区域,其中每个候选区域包括一个或多个单词框。 假设从数据字段和候选区域形成,其中每个假设将一个候选区域分配给数据字段之一或空数据字段。 对于最优假设集执行假设的约束优化搜索。 最佳假设集合将字框组分配给相应的数据字段。

    Graph lattice method for image clustering, classification, and repeated structure finding
    8.
    发明授权
    Graph lattice method for image clustering, classification, and repeated structure finding 有权
    用于图像聚类,分类和重复结构查找的图形格子方法

    公开(公告)号:US08724911B2

    公开(公告)日:2014-05-13

    申请号:US12883503

    申请日:2010-09-16

    申请人: Eric Saund

    发明人: Eric Saund

    CPC分类号: G06K9/6892 G06K9/00449

    摘要: A document recognition system and method, where images are represented as a collection of primitive features whose spatial relations are represented as a graph. Useful subsets of all the possible subgraphs representing different portions of images are represented over a corpus of many images. The data structure is a lattice of subgraphs, and algorithms are provided means to build and use the graph lattice efficiently and effectively.

    摘要翻译: 一种文档识别系统和方法,其中图像被表示为其空间关系被表示为图形的原始特征的集合。 表示图像的不同部分的所有可能子图的有用子集在许多图像的语料库上表示。 数据结构是子图的格子,提供了有效和高效地构建和使用图形格子的算法。

    SELECTIVE LEARNING FOR GROWING A GRAPH LATTICE
    9.
    发明申请
    SELECTIVE LEARNING FOR GROWING A GRAPH LATTICE 有权
    选择学习用于生成图形格式

    公开(公告)号:US20130335422A1

    公开(公告)日:2013-12-19

    申请号:US13527071

    申请日:2012-06-19

    申请人: Eric Saund

    发明人: Eric Saund

    IPC分类号: G06T11/20

    CPC分类号: G06T11/206 G06K9/00

    摘要: A system and method generate a graph lattice from exemplary images. At least one processor receives exemplary data graphs of the exemplary images and generates graph lattice nodes of size one from primitives. Until a termination condition is met, the at least one processor repeatedly: 1) generates candidate graph lattice nodes from accepted graph lattice nodes; 2) selects one or more candidate graph lattice nodes preferentially discriminating exemplary data graphs which are less discriminable than other exemplary data graphs using the accepted graph lattice nodes; and 3) promotes the selected graph lattice nodes to accepted status. The graph lattice is formed from the accepted graph lattice nodes and relations between the accepted graph lattice nodes.

    摘要翻译: 系统和方法从示例性图像生成图形点阵。 至少一个处理器接收示例性图像的示例性数据图,并从图元生成大小为1的图形格子节点。 在满足终止条件之前,所述至少一个处理器重复:1)从接受的图形格子节点生成候选图格点阵; 2)选择一个或多个候选图形格子节点优先区分使用所接受的图形格子节点而不比其他示例性数据图可辨别的示例性数据图; 和3)促进所选择的图形点阵节点接受状态。 图形格子由公认的图形点阵节点和接受的图形点阵节点之间的关系形成。

    METHOD FOR GENERATING A GRAPH LATTICE FROM A CORPUS OF ONE OR MORE DATA GRAPHS
    10.
    发明申请
    METHOD FOR GENERATING A GRAPH LATTICE FROM A CORPUS OF ONE OR MORE DATA GRAPHS 有权
    从一个或多个数据图形的公司生成图形格式的方法

    公开(公告)号:US20120069024A1

    公开(公告)日:2012-03-22

    申请号:US12883464

    申请日:2010-09-16

    申请人: Eric Saund

    发明人: Eric Saund

    IPC分类号: G06T11/20

    CPC分类号: G06T11/206

    摘要: A document recognition system and method, where images are represented as a collection of primitive features whose spatial relations are represented as a graph. Useful subsets of all the possible subgraphs representing different portions of images are represented over a corpus of many images. The data structure is a lattice of subgraphs, and algorithms are provided means to build and use the graph lattice efficiently and effectively.

    摘要翻译: 一种文档识别系统和方法,其中图像被表示为其空间关系被表示为图形的原始特征的集合。 表示图像的不同部分的所有可能子图的有用子集在许多图像的语料库上表示。 数据结构是子图的格子,提供了有效和高效地构建和使用图形格子的算法。