HIGH PRECISION WEB EXTRACTION USING SITE KNOWLEDGE
    1.
    发明申请
    HIGH PRECISION WEB EXTRACTION USING SITE KNOWLEDGE 审中-公开
    使用站点知识的高精度网络提取

    公开(公告)号:US20100257440A1

    公开(公告)日:2010-10-07

    申请号:US12416381

    申请日:2009-04-01

    IPC分类号: G06F17/21 G06F17/00

    CPC分类号: G06F16/986

    摘要: Techniques for high precision web extraction using site knowledge are provided. Portions of repeating text are identified in unlabeled web pages from a particular web site. Based on the portions of repeating text, the unlabeled web pages are partitioned into a set of segments. Multiple labels are assigned to respectively corresponding multiple attributes in the set of segments, where assigning the multiple labels comprises applying a classification model to each separate segment in the set of segments. First one or more labels are identified that were erroneously assigned to one or more attributes in the set of segments. Second one or more correct labels for the one or more attributes are determined. The first one or more labels in the set of segments are corrected by assigning the second one or more labels to the one or more attributes.

    摘要翻译: 提供使用现场知识的高精度网络提取技术。 在特定网站的未标记的网页中标识重复文本的部分。 基于重复文本的部分,未标记的网页被分割成一组段。 多个标签被分配给该组段中的相应的多个属性,其中分配多个标签包括将分类模型应用于该组段中的每个单独的段。 识别出错误地分配给该组段中的一个或多个属性的第一个或多个标签。 确定一个或多个属性的第二个一个或多个正确标签。 通过将第二个一个或多个标签分配给一个或多个属性来校正该组段中的第一个或多个标签。

    AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS
    2.
    发明申请
    AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS 审中-公开
    使用机器学习的结构提升机自动提取

    公开(公告)号:US20100223214A1

    公开(公告)日:2010-09-02

    申请号:US12395586

    申请日:2009-02-27

    IPC分类号: G06F15/18

    CPC分类号: G06F16/86

    摘要: A method and apparatus for automatically extracting information from a large number of documents through applying machine learning techniques and exploiting structural similarities among documents. A machine learning model is trained to have at least 50% accuracy. The trained machine learning model is used to identify information attributes in a sample of pages from a cluster of structurally similar documents. A structure-specific model of the cluster is created by compiling a list of top-K locations for each attribute identified by the trained machine learning model in the sample. These top-K lists are used to extract information from the pages of the cluster from which the sample of pages was taken.

    摘要翻译: 一种通过应用机器学习技术和利用文档之间的结构相似性自动从大量文档中提取信息的方法和装置。 训练机器学习模型至少有50%的准确性。 训练有素的机器学习模型用于识别来自结构类似文档的集群的页面样本中的信息属性。 通过编译样本中经过训练的机器学习模型识别的每个属性的顶部K位置的列表来创建集群的结构特定模型。 这些顶级K列表用于从从中获取页面样本的群集页面中提取信息。

    BOOSTING EXTRACTION ACCURACY BY HANDLING TRAINING DATA BIAS
    3.
    发明申请
    BOOSTING EXTRACTION ACCURACY BY HANDLING TRAINING DATA BIAS 审中-公开
    通过处理培训数据偏差提高提取精度

    公开(公告)号:US20090216739A1

    公开(公告)日:2009-08-27

    申请号:US12036079

    申请日:2008-02-22

    IPC分类号: G06F7/10 G06F17/30

    CPC分类号: G06F16/313

    摘要: Methods and apparatus are described for use with information extraction techniques based on sequential models. Additional statistics are maintained during inference and employed to boost the accuracy of the extraction algorithm and mitigate the effects of training bias.

    摘要翻译: 描述了基于顺序模型的信息提取技术的方法和装置。 在推理过程中,维持其他统计数据,用于提高算法的准确性,减轻训练偏差的影响。