APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

发明申请

US20100241639A1 APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION 审中-公开

标题翻译：概念中心信息提取的装置和方法

请登陆查看更多内容

专利标题： APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION
专利标题（中）： 概念中心信息提取的装置和方法
申请号： US12408450

申请日： 2009-03-20
公开(公告)号： US20100241639A1

公开(公告)日： 2010-09-23
发明人: Daniel Kifer , Srujana Merugu , Ankur Jain , Sathiya Keerthi Selvaraj , Alok S. Kirpal , Philip L. Bohannon , Raghu Ramakrishnan
申请人： Daniel Kifer , Srujana Merugu , Ankur Jain , Sathiya Keerthi Selvaraj , Alok S. Kirpal , Philip L. Bohannon , Raghu Ramakrishnan
申请人地址： US CA Sunnyvale
专利权人： YAHOO! INC.
当前专利权人： YAHOO! INC.
当前专利权人地址： US CA Sunnyvale
主分类号： G06F17/30
IPC分类号： G06F17/30

APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

摘要：

Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.

摘要（中）：

公开了从网页内容中提取（或注释）结构化信息的方法和装置。来自特定域的感兴趣的Web内容被表示为具有多个分支节点的一个或多个树实例，每个分支节点对应于web对象，使得树实例对应于一个或多个结构化数据实例。特定域与域知识相关联，其包括一个或多个呈现规则集，每个表示规则集指定一组数据实例的特定结构，特定于域的概念标签器，树实例中的web对象的一个或多个指定的属性，以及一个概念模式，指定要从Web内容中提取的数据的表示。基于特定域的域知识，从一个或多个树实例提取符合概念模式的结构化数据实例。结构化数据实例的提取是通过（i）使用域特定概念标签器来注释树实例的节点的子集来实现的; 以及（ii）使用本地适应性概念注释器基于所注释的段和与这些注释段相关联的本地属性来提取结构化数据实例。提取的结构化数据实例作为结构化输出记录存储在数据库中。

信息查询

Global Dossier Espacenet