-
公开(公告)号:US08122005B1
公开(公告)日:2012-02-21
申请号:US12604025
申请日:2009-10-22
申请人: Philo Juang , Christopher Testa , Nicolaus Mote
发明人: Philo Juang , Christopher Testa , Nicolaus Mote
IPC分类号: G06F17/30
CPC分类号: G06F17/30707
摘要: A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.
摘要翻译: 训练集生成器可以被配置为输入包括类别的层级和多个顶级站点的分类,并且输出分类数据的训练集合。 训练集生成器可以包括被配置为爬取每个顶级站点以确定与其相关联的至少一个下级站点并将顶级站点和相关联的较低级站点存储为爬网数据的爬行器。 训练集生成器还可以包括提取器,其被配置为针对每个顶级站点确定相应的站点特定提取模板,其将相应顶级站点的至少一部分与至少一个类别的层次结构相关联 类别,并且还被配置为将每个站点特定提取模板应用于对应的抓取数据,从而将爬网数据与分层类别的类别相关联,并获得训练集合的分类数据。
-
公开(公告)号:US08484194B1
公开(公告)日:2013-07-09
申请号:US13350213
申请日:2012-01-13
申请人: Philo Juang , Christopher Testa , Nicolaus Mote
发明人: Philo Juang , Christopher Testa , Nicolaus Mote
IPC分类号: G06F17/30
CPC分类号: G06F17/30707
摘要: A training set generator may be configured to input a taxonomy including a hierarchy of categories and a plurality of top-level sites, and to output a training set of categorized data. The training set generator may include a crawler configured to crawl each of the top-level sites to determine at least one lower-level site associated therewith and to store the top-level sites and associated lower-level sites as crawl data. The training set generator also may include an extractor configured to determine, for each of the top-level sites, a corresponding site-specific extraction template associating at least one portion of the corresponding top-level site with at least one category of the hierarchy of categories, and further configured to apply each site-specific extraction template to corresponding crawl data to thereby associate the crawl data with the categories of the hierarchical categories and obtain categorized data of the training set.
摘要翻译: 训练集生成器可以被配置为输入包括类别的层级和多个顶级站点的分类,并且输出分类数据的训练集合。 训练集生成器可以包括被配置为爬取每个顶级站点以确定与其相关联的至少一个下级站点并将顶级站点和相关联的较低级站点存储为爬网数据的爬行器。 训练集生成器还可以包括提取器,其被配置为针对每个顶级站点确定相应的站点特定提取模板,其将相应顶级站点的至少一部分与至少一个类别的层次结构相关联 类别,并且还被配置为将每个站点特定提取模板应用于对应的抓取数据,从而将爬网数据与分层类别的类别相关联,并获得训练集合的分类数据。
-
公开(公告)号:US08645384B1
公开(公告)日:2014-02-04
申请号:US12774448
申请日:2010-05-05
申请人: Philo Juang , Christopher Testa , Nicolaus Mote
发明人: Philo Juang , Christopher Testa , Nicolaus Mote
CPC分类号: G06F17/30896 , G06F17/30722 , G06F17/30867
摘要: According to an example implementation, a computer-implemented method may include extracting, by a computing device, structured content from a website, determining a recent taxonomy by applying category rules to the structured content, the recent taxonomy including multiple categories and a new category, and updating a stored taxonomy based on the determined recent taxonomy by adding the new category to the stored taxonomy.
摘要翻译: 根据示例实现,计算机实现的方法可以包括由计算设备从网站提取结构化内容,通过对结构化内容应用类别规则来确定最近的分类法,最近的分类法包括多个类别和新类别, 并通过将新类别添加到存储的分类法来更新基于确定的最近分类法的存储分类。
-
-