TECHNIQUES FOR INDUCING HIGH QUALITY STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS
    1.
    发明申请
    TECHNIQUES FOR INDUCING HIGH QUALITY STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS 有权
    用于诱导电子文件的高质量结构模板的技术

    公开(公告)号:US20080072140A1

    公开(公告)日:2008-03-20

    申请号:US11945749

    申请日:2007-11-27

    IPC分类号: G06F15/00

    摘要: Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.

    摘要翻译: 本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。 将模板的结构与训练集中的文档(或每个文档的至少一部分)的结构进行逐一比较,并根据模板与模板之间的差异进行一般化 目前正在比较。 如果任何特定文档的结构被认为与模板的结构太不相似,则不会修改该模板。 将各种泛化运算符添加到模板中以推广模板。 一个这样的泛化运算符是“OR”,其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。