SITE-SPECIFIC INFORMATION-TYPE DETECTION METHODS AND SYSTEMS
    1.
    发明申请
    SITE-SPECIFIC INFORMATION-TYPE DETECTION METHODS AND SYSTEMS 审中-公开
    站点特定信息类型检测方法和系统

    公开(公告)号:US20090248707A1

    公开(公告)日:2009-10-01

    申请号:US12055222

    申请日:2008-03-25

    IPC分类号: G06F17/30

    摘要: Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.

    摘要翻译: 本文提供的方法和系统可以允许在一个或多个文档(例如与一个或多个网站相关联的网页文档)中定位或以其他方式识别数据的相关信息类型。 例如,提供了示例性方法和系统,其可以用于确定信息是否可能更可能是“信息”类型的信息,或者可能更可能是“噪声”类型的信息。

    METHOD AND SYSTEM FOR WEB INFORMATION EXTRACTION
    2.
    发明申请
    METHOD AND SYSTEM FOR WEB INFORMATION EXTRACTION 有权
    网络信息抽取方法与系统

    公开(公告)号:US20120084636A1

    公开(公告)日:2012-04-05

    申请号:US12896942

    申请日:2010-10-04

    IPC分类号: G06F17/00

    摘要: An example of a method includes determining features of a first type for a web page of a plurality of web pages. The method also includes electronically determining a plurality of rules for an attribute of the first web page, wherein the plurality of rules are determined based on features of the first type. The method also includes electronically identifying a first rule, from the plurality of rules, which satisfies a first predefined criterion. The first predefined criteria include at least one of a first threshold for a precision parameter, a second threshold for a support parameter, a third threshold for a distance parameter and a fourth threshold for a recall parameter. The method further includes storing the first rule to enable extraction of value of the attribute from a second web page.

    摘要翻译: 一种方法的示例包括确定多个网页中的网页的第一类型的特征。 该方法还包括电子地确定用于第一网页的属性的多个规则,其中基于第一类型的特征来确定多个规则。 该方法还包括从满足第一预定准则的多个规则中电子地识别第一规则。 第一预定准则包括精度参数的第一阈值,支持参数的第二阈值,距离参数的第三阈值和召回参数的第四阈值中的至少一个。 该方法还包括存储第一规则以便能够从第二网页提取属性的值。

    HIGH PRECISION MULTI ENTITY EXTRACTION
    3.
    发明申请
    HIGH PRECISION MULTI ENTITY EXTRACTION 审中-公开
    高精度多重实体提取

    公开(公告)号:US20100185684A1

    公开(公告)日:2010-07-22

    申请号:US12351676

    申请日:2009-01-09

    IPC分类号: G06F17/30

    CPC分类号: G06F16/986

    摘要: Techniques for high precision multi entity extraction are provided. A wrapper that represents a generalized structure of a set of training web pages is accessed. The wrapper includes one or more annotations that indicate a set of attributes that are included in each of a plurality of records. Record boundaries are determined based on nodes included in the wrapper, where the record boundaries delimit the plurality of records within any training page of the set of training web pages. The wrapper is modified to include one or more boundary nodes, where the one or more boundary nodes indicate the record boundaries of the plurality of records within the set of training web pages. Multiple records are extracted from a web page, where extracting the multiple records comprises detecting record completions based at least on the wrapper and on a document object model (DOM) representation of the web page.

    摘要翻译: 提供了高精度多实体提取技术。 访问代表一组训练网页的一般结构的包装器。 包装器包括指示包括在多个记录中的每一个中的一组属性的一个或多个注释。 记录边界是基于包含在包装器中的节点来确定的,其中记录边界限定训练网页集合的任何训练页面内的多个记录。 包装器被修改为包括一个或多个边界节点,其中一个或多个边界节点指示训练网页集合内的多个记录的记录边界。 从网页提取多个记录,其中提取多个记录包括至少基于包装器和网页的文档对象模型(DOM)表示来检测记录完成。

    ROBUST XPATHS FOR WEB INFORMATION EXTRACTION
    6.
    发明申请
    ROBUST XPATHS FOR WEB INFORMATION EXTRACTION 审中-公开
    用于WEB信息提取的稳健XPATHS

    公开(公告)号:US20110040770A1

    公开(公告)日:2011-02-17

    申请号:US12540384

    申请日:2009-08-13

    IPC分类号: G06F17/30

    CPC分类号: G06F16/95

    摘要: An example of a method includes generating an attributed extensible markup language path (XPath) for an annotated entity in a web page. The method further includes determining a first node that satisfy the attributed XPath in the web page and is annotated. The method also includes identifying an attribute property that satisfies predefined criteria in the web page while traversing from the first node to a root node, the attribute property comprising an attribute value and an attribute name. Moreover, the method includes populating the attributed XPath with the attribute property that satisfies predefined criteria. The method also includes filtering the attributed XPath to generate a robust XPath, and extracting content from multiple web pages based on the robust XPath.

    摘要翻译: 方法的示例包括为网页中的注释实体生成属性可扩展标记语言路径(XPath)。 该方法还包括确定满足网页中归属的XPath并被注释的第一节点。 该方法还包括识别在从第一节点遍历到根节点时满足网页中的预定标准的属性属性,该属性属性包括属性值和属性名称。 此外,该方法包括使用满足预定义条件的attribute属性填充归属的XPath。 该方法还包括过滤归因的XPath以生成鲁棒的XPath,以及基于强大的XPath从多个网页提取内容。

    Techniques for inducing high quality structural templates for electronic documents
    7.
    发明授权
    Techniques for inducing high quality structural templates for electronic documents 有权
    诱导电子文件高质量结构模板的技术

    公开(公告)号:US08046681B2

    公开(公告)日:2011-10-25

    申请号:US11945749

    申请日:2007-11-27

    IPC分类号: G06F17/00

    摘要: Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.

    摘要翻译: 本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。 将模板的结构与训练集中的文档(或每个文档的至少一部分)的结构进行逐一比较,并根据模板与模板之间的差异进行一般化 目前正在比较。 如果任何特定文档的结构被认为与模板的结构太不相似,则不会修改该模板。 将各种泛化运算符添加到模板中以推广模板。 一个这样的泛化运算符是“OR”,其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。

    TECHNIQUES FOR INDUCING HIGH QUALITY STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS
    9.
    发明申请
    TECHNIQUES FOR INDUCING HIGH QUALITY STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS 有权
    用于诱导电子文件的高质量结构模板的技术

    公开(公告)号:US20080072140A1

    公开(公告)日:2008-03-20

    申请号:US11945749

    申请日:2007-11-27

    IPC分类号: G06F15/00

    摘要: Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.

    摘要翻译: 本文公开了自动学习描述训练集中的文档中存在的共同结构的模板的技术。 将模板的结构与训练集中的文档(或每个文档的至少一部分)的结构进行逐一比较,并根据模板与模板之间的差异进行一般化 目前正在比较。 如果任何特定文档的结构被认为与模板的结构太不相似,则不会修改该模板。 将各种泛化运算符添加到模板中以推广模板。 一个这样的泛化运算符是“OR”,其指示在文档中的相应位置仅允许在模板中“OR”运算符之下的“n”个子树中只有一个子树。