Systems and methods for identifying and extracting data from HTML pages

发明申请

US20050273706A1 Systems and methods for identifying and extracting data from HTML pages 审中-公开

标题翻译：用于从HTML页面识别和提取数据的系统和方法

请登陆查看更多内容

专利标题： Systems and methods for identifying and extracting data from HTML pages
专利标题（中）： 用于从HTML页面识别和提取数据的系统和方法
申请号： US11122992

申请日： 2005-05-04
公开(公告)号： US20050273706A1

公开(公告)日： 2005-12-08
发明人: Udi Manber , Qi Lu
申请人： Udi Manber , Qi Lu
申请人地址： US CA Sunnyvale
专利权人： Yahoo! Inc.
当前专利权人： Yahoo! Inc.
当前专利权人地址： US CA Sunnyvale
主分类号： G06F17/00
IPC分类号： G06F17/00 ; G06F17/30

Systems and methods for identifying and extracting data from HTML pages

摘要：

Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is fast and efficient and performs the extraction process quickly in real-time. The systems and methods are useful to build databases from unstructured web information. The algorithm can be used as an agent that captures information about products, and compares prices or other characteristics. It can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. And it can also be used for data mining applications looking for patterns useful for marketing analyses, or other uses.

摘要（中）：

用于分析HTML格式的网页以自动识别和提取所需信息的系统和方法。计算机算法在最少的手动设置后自动识别和提取不同网页的不同信息。如果算法具有相同或相似的格式，则算法会自动分析具有不同内容的页面。该算法快速有效，实时快速执行提取过程。系统和方法对从非结构化Web信息构建数据库非常有用。该算法可以用作捕获关于产品的信息的代理，并且比较价格或其他特征。它还可以用于填充结构化数据库，鉴于不同的信息，可以分析产品及其特征。并且它也可以用于寻找可用于营销分析或其他用途的模式的数据挖掘应用程序。

信息查询

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F17/00	特别适用于特定功能的数字计算设备或数据处理设备或数据处理方法（信息检索，数据库结构或文件系统结构，G06F 16/00）