Training set construction for taxonomic classification
    1.
    发明授权
    Training set construction for taxonomic classification 有权
    分类分类培训班

    公开(公告)号:US08122005B1

    公开(公告)日:2012-02-21

    申请号:US12604025

    申请日:2009-10-22

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30707

    摘要: A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.

    摘要翻译: 训练集生成器可以被配置为输入包括类别的层级和多个顶级站点的分类,并且输出分类数据的训练集合。 训练集生成器可以包括被配置为爬取每个顶级站点以确定与其相关联的至少一个下级站点并将顶级站点和相关联的较低级站点存储为爬网数据的爬行器。 训练集生成器还可以包括提取器,其被配置为针对每个顶级站点确定相应的站点特定提取模板,其将相应顶级站点的至少一部分与至少一个类别的层次结构相关联 类别,并且还被配置为将每个站点特定提取模板应用于对应的抓取数据,从而将爬网数据与分层类别的类别相关联,并获得训练集合的分类数据。

    Training set construction for taxonomic classification
    2.
    发明授权
    Training set construction for taxonomic classification 有权
    分类分类培训班

    公开(公告)号:US08484194B1

    公开(公告)日:2013-07-09

    申请号:US13350213

    申请日:2012-01-13

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30707

    摘要: A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.

    摘要翻译: 训练集生成器可以被配置为输入包括类别的层级和多个顶级站点的分类,并且输出分类数据的训练集合。 训练集生成器可以包括被配置为爬取每个顶级站点以确定与其相关联的至少一个下级站点并将顶级站点和相关联的较低级站点存储为爬网数据的爬行器。 训练集生成器还可以包括提取器,其被配置为针对每个顶级站点确定相应的站点特定提取模板,其将相应顶级站点的至少一部分与至少一个类别的层次结构相关联 类别,并且还被配置为将每个站点特定提取模板应用于对应的抓取数据,从而将爬网数据与分层类别的类别相关联,并获得训练集合的分类数据。

    Updating taxonomy based on webpage
    3.
    发明授权
    Updating taxonomy based on webpage 有权
    基于网页更新分类法

    公开(公告)号:US08645384B1

    公开(公告)日:2014-02-04

    申请号:US12774448

    申请日:2010-05-05

    IPC分类号: G06F7/00 G06F17/30

    摘要: According to an example implementation, a computer-implemented method may include extracting, by a computing device, structured content from a website, determining a recent taxonomy by applying category rules to the structured content, the recent taxonomy including multiple categories and a new category, and updating a stored taxonomy based on the determined recent taxonomy by adding the new category to the stored taxonomy.

    摘要翻译: 根据示例实现,计算机实现的方法可以包括由计算设备从网站提取结构化内容,通过对结构化内容应用类别规则来确定最近的分类法,最近的分类法包括多个类别和新类别, 并通过将新类别添加到存储的分类法来更新基于确定的最近分类法的存储分类。