发明申请
US20100241639A1 APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION
审中-公开
概念中心信息提取的装置和方法
- 专利标题: APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION
- 专利标题(中): 概念中心信息提取的装置和方法
-
申请号: US12408450申请日: 2009-03-20
-
公开(公告)号: US20100241639A1公开(公告)日: 2010-09-23
- 发明人: Daniel Kifer , Srujana Merugu , Ankur Jain , Sathiya Keerthi Selvaraj , Alok S. Kirpal , Philip L. Bohannon , Raghu Ramakrishnan
- 申请人: Daniel Kifer , Srujana Merugu , Ankur Jain , Sathiya Keerthi Selvaraj , Alok S. Kirpal , Philip L. Bohannon , Raghu Ramakrishnan
- 申请人地址: US CA Sunnyvale
- 专利权人: YAHOO! INC.
- 当前专利权人: YAHOO! INC.
- 当前专利权人地址: US CA Sunnyvale
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.
信息查询