发明申请
US20050273706A1 Systems and methods for identifying and extracting data from HTML pages 审中-公开
用于从HTML页面识别和提取数据的系统和方法

  • 专利标题: Systems and methods for identifying and extracting data from HTML pages
  • 专利标题(中): 用于从HTML页面识别和提取数据的系统和方法
  • 申请号: US11122992
    申请日: 2005-05-04
  • 公开(公告)号: US20050273706A1
    公开(公告)日: 2005-12-08
  • 发明人: Udi ManberQi Lu
  • 申请人: Udi ManberQi Lu
  • 申请人地址: US CA Sunnyvale
  • 专利权人: Yahoo! Inc.
  • 当前专利权人: Yahoo! Inc.
  • 当前专利权人地址: US CA Sunnyvale
  • 主分类号: G06F17/00
  • IPC分类号: G06F17/00 G06F17/30
Systems and methods for identifying and extracting data from HTML pages
摘要:
Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is fast and efficient and performs the extraction process quickly in real-time. The systems and methods are useful to build databases from unstructured web information. The algorithm can be used as an agent that captures information about products, and compares prices or other characteristics. It can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. And it can also be used for data mining applications looking for patterns useful for marketing analyses, or other uses.
信息查询
0/0