APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION
    1.
    发明申请
    APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION 审中-公开
    概念中心信息提取的装置和方法

    公开(公告)号:US20100241639A1

    公开(公告)日:2010-09-23

    申请号:US12408450

    申请日:2009-03-20

    IPC分类号: G06F17/30

    CPC分类号: G06F16/345 G06F16/313

    摘要: Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.

    摘要翻译: 公开了从网页内容中提取(或注释)结构化信息的方法和装置。 来自特定域的感兴趣的Web内容被表示为具有多个分支节点的一个或多个树实例,每个分支节点对应于web对象,使得树实例对应于一个或多个结构化数据实例。 特定域与域知识相关联,其包括一个或多个呈现规则集,每个表示规则集指定一组数据实例的特定结构,特定于域的概念标签器,树实例中的web对象的一个​​或多个指定的属性,以及 一个概念模式,指定要从Web内容中提取的数据的表示。 基于特定域的域知识,从一个或多个树实例提取符合概念模式的结构化数据实例。 结构化数据实例的提取是通过(i)使用域特定概念标签器来注释树实例的节点的子集来实现的; 以及(ii)使用本地适应性概念注释器基于所注释的段和与这些注释段相关联的本地属性来提取结构化数据实例。 提取的结构化数据实例作为结构化输出记录存储在数据库中。

    Method and system for form-filling crawl and associating rich keywords
    3.
    发明授权
    Method and system for form-filling crawl and associating rich keywords 有权
    表单填充方法和系统抓取和关联丰富的关键字

    公开(公告)号:US08793239B2

    公开(公告)日:2014-07-29

    申请号:US12576011

    申请日:2009-10-08

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/30864

    摘要: Techniques are provided for the efficient location, processing, and retrieval of local product information derived from web pages generally locatable through form queries submitted to web pages often referred to as the “deep” or “hidden” web. In an embodiment, information such as product information and dealer-location information is located on a web page form such as a dealer-locator form. After location of a suitable web page form, editorial wrapping is performed to create an automated information extraction process. Using the automated information extractor, deep-web crawling is performed. A grid-based extraction of individual business records is performed, and matching and ingestion are performed in conjunction with a business listing database. Finally, metadata tags are added to entries in the business listing database. Metadata tags also may be added to entries in other databases.

    摘要翻译: 提供了技术,用于有效地定位,处理和检索从通常可通过提交到通常被称为“深”或“隐藏”网络的网页的表单查询的网页获得的本地产品信息。 在一个实施例中,诸如产品信息和经销商位置信息的信息位于诸如经销商定位器形式的网页形式上。 在找到合适的网页表单之后,执行编辑包装以创建自动化信息提取过程。 使用自动信息提取器,执行深度网页抓取。 执行单个业务记录的基于网格的提取,并且与业务列表数据库一起执行匹配和摄取。 最后,元数据标签被添加到业务列表数据库中的条目。 元数据标签也可以添加到其他数据库中的条目。

    Method and System for Form-Filling Crawl and Associating Rich Keywords
    4.
    发明申请
    Method and System for Form-Filling Crawl and Associating Rich Keywords 有权
    填写查询和关联丰富关键字的方法和系统

    公开(公告)号:US20110087646A1

    公开(公告)日:2011-04-14

    申请号:US12576011

    申请日:2009-10-08

    IPC分类号: G06F7/10 G06F17/30

    CPC分类号: G06F17/30864

    摘要: Techniques are provided for the efficient location, processing, and retrieval of local product information derived from web pages generally locatable through form queries submitted to web pages often referred to as the “deep” or “hidden” web. In an embodiment, information such as product information and dealer-location information is located on a web page form such as a dealer-locator form. After location of a suitable web page form, editorial wrapping is performed to create an automated information extraction process. Using the automated information extractor, deep-web crawling is performed. A grid-based extraction of individual business records is performed, and matching and ingestion are performed in conjunction with a business listing database. Finally, metadata tags are added to entries in the business listing database. Metadata tags also may be added to entries in other databases.

    摘要翻译: 提供技术用于从通常通过提交到通常被称为“深”或“隐藏”网络的网页的表单查询的定位的网页获得的本地产品信息的有效定位,处理和检索。 在一个实施例中,诸如产品信息和经销商位置信息的信息位于诸如经销商定位器形式的网页形式上。 在找到合适的网页表单之后,执行编辑包装以创建自动化信息提取过程。 使用自动信息提取器,执行深度网页抓取。 执行单个业务记录的基于网格的提取,并且与业务列表数据库一起执行匹配和摄取。 最后,元数据标签被添加到业务列表数据库中的条目。 元数据标签也可以添加到其他数据库中的条目。

    TRANSDUCTIVE APPROACH TO CATEGORY-SPECIFIC RECORD ATTRIBUTE EXTRACTION
    5.
    发明申请
    TRANSDUCTIVE APPROACH TO CATEGORY-SPECIFIC RECORD ATTRIBUTE EXTRACTION 审中-公开
    对特定记录属性提取的传播方法

    公开(公告)号:US20100274770A1

    公开(公告)日:2010-10-28

    申请号:US12429442

    申请日:2009-04-24

    IPC分类号: G06F17/30

    CPC分类号: G06F16/951 G06F16/285

    摘要: Disclosed are methods and apparatus for segmenting and labeling a collection of token sequences. A plurality of segments of one or more tokens in a token sequence collection are partially labeled with labels from a set of target labels using high precision domain-specific labelers so as to generate a partially labeled sequence collection having a plurality of labeled segments and a plurality of unlabeled segments. Any label conflicts in the partially labeled sequence collection are resolved. One or more of the labeled segments of the partially labeled sequence collection are expanded so as to cover one or more additional tokens of the partially labeled sequence collection. A statistical model, for labeling segments using local token and segment features of the sequence collection, is trained based on the partially labeled sequence collection. This trained model is then used to label the unlabeled segments and the labeled segments of the sequence collection so as to generate a labeled sequence collection. The labeled sequence collection is then stored as structured output records in a database.

    摘要翻译: 公开了用于分割和标记令牌序列集合的方法和装置。 令牌序列集合中的一个或多个令牌的多个片段使用高精度域专用标签器从一组目标标签部分标记,以便生成具有多个标记片段和多个标记片段的部分标记序列集合 的未标记片段。 部分标记的序列集合中的任何标签冲突都被解决。 扩展部分标记的序列集合的一个或多个标记片段,以覆盖部分标记的序列集合的一个或多个附加标记。 基于部分标记的序列集合训练用于使用本地令牌和序列集合的片段特征来标记片段的统计模型。 然后将该训练模型用于标记序列集合的未标记片段和标记片段,以产生标记序列集合。 标记的序列集合然后作为结构化输出记录存储在数据库中。

    Extracting rich temporal context for business entities and events
    6.
    发明授权
    Extracting rich temporal context for business entities and events 有权
    为业务实体和事件提取丰富的时间背景

    公开(公告)号:US08606564B2

    公开(公告)日:2013-12-10

    申请号:US12917389

    申请日:2010-11-01

    IPC分类号: G06F17/27 G06F17/30

    摘要: Methods and apparatus for performing computer-implemented extraction of temporal information for business entities and events are disclosed. In one embodiment, a sequence of text is obtained. A label is assigned to one or more of a plurality of segments of the text such that each of the one or more of the plurality of segments of the text is classified as temporal data in one of a plurality of classes of temporal data. One or more rules are applied to the one or more segments of the text that have been classified as temporal data to generate a structured representation of the temporal data, where the rules include one or more schematic rules. Each of the schematic rules pertains to one or more of the plurality of classes of temporal data and indicates a structure in which temporal data in the corresponding one or more of the plurality of classes is to be stored.

    摘要翻译: 公开了用于为商业实体和事件执行计算机实现的时间信息提取的方法和装置。 在一个实施例中,获得文本序列。 将标签分配给文本的多个片段中的一个或多个,使得文本的多个片段中的一个或多个片段中的每一个被分类为多个类别的时间数据之一的时间数据。 将一个或多个规则应用于已被分类为时间数据的文本的一个或多个段以生成时间数据的结构化表示,其中规则包括一个或多个示意图规则。 示意性规则中的每一个涉及多个时间数据类别中的一个或多个,并且指示要存储多个类中对应的一个或多个类别中的时间数据的结构。

    EXTRACTING RICH TEMPORAL CONTEXT FOR BUSINESS ENTITIES AND EVENTS
    7.
    发明申请
    EXTRACTING RICH TEMPORAL CONTEXT FOR BUSINESS ENTITIES AND EVENTS 有权
    为商业实体和活动提供丰富的时间背景

    公开(公告)号:US20120109637A1

    公开(公告)日:2012-05-03

    申请号:US12917389

    申请日:2010-11-01

    IPC分类号: G06F17/27 G06F17/30

    摘要: Methods and apparatus for performing computer-implemented extraction of temporal information for business entities and events are disclosed. In one embodiment, a sequence of text is obtained. A label is assigned to one or more of a plurality of segments of the text such that each of the one or more of the plurality of segments of the text is classified as temporal data in one of a plurality of classes of temporal data. One or more rules are applied to the one or more segments of the text that have been classified as temporal data to generate a structured representation of the temporal data, where the rules include one or more schematic rules. Each of the schematic rules pertains to one or more of the plurality of classes of temporal data and indicates a structure in which temporal data in the corresponding one or more of the plurality of classes is to be stored.

    摘要翻译: 公开了用于为商业实体和事件执行计算机实现的时间信息提取的方法和装置。 在一个实施例中,获得文本序列。 将标签分配给文本的多个片段中的一个或多个,使得文本的多个片段中的一个或多个片段中的每一个被分类为多个类别的时间数据之一的时间数据。 将一个或多个规则应用于已被分类为时间数据的文本的一个或多个段以生成时间数据的结构化表示,其中规则包括一个或多个示意图规则。 示意性规则中的每一个涉及多个时间数据类别中的一个或多个,并且指示要存储多个类中对应的一个或多个类别中的时间数据的结构。

    Rapid iterative development of classifiers
    8.
    发明授权
    Rapid iterative development of classifiers 有权
    分类器的快速迭代开发

    公开(公告)号:US08849790B2

    公开(公告)日:2014-09-30

    申请号:US12344132

    申请日:2008-12-24

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30265 G06F17/3028

    摘要: A classifier development process seamlessly and intelligently integrates different forms of human feedback on instances and features into the data preparation, learning and evaluation stages. A query utility based active learning approach is applicable to different types of editorial feedback. A bi-clustering based technique may be used to further speed up the active learning process.

    摘要翻译: 分类器开发过程将数据准备,学习和评估阶段的实例和特征的不同形式的人类反馈无缝智能地整合在一起。 基于查询实用程序的主动学习方法适用于不同类型的编辑反馈。 可以使用基于双聚类的技术来进一步加速主动学习过程。

    PAIRWISE RANKING-BASED CLASSIFIER
    9.
    发明申请
    PAIRWISE RANKING-BASED CLASSIFIER 有权
    基于排序的分类器

    公开(公告)号:US20110099131A1

    公开(公告)日:2011-04-28

    申请号:US12603763

    申请日:2009-10-22

    IPC分类号: G06F15/18 G06N5/02

    CPC分类号: G06N99/005 G06F17/30707

    摘要: The present invention provides methods and systems for binary classification of items. Methods and systems are provided for constructing a machine learning-based and pairwise ranking method-based classification model for binary classification of items as positive or negative with regard to a single class, based on training using a training set of examples including positive examples and unlabelled examples. The model includes only one hyperparameter and only one threshold parameter, which are selected to optimize the model with regard to constraining positive items to be classified as positive while minimizing a number of unlabelled items classified as positive.

    摘要翻译: 本发明提供了用于项目二进制分类的方法和系统。 提供方法和系统,用于构建基于机器学习和成对排序方法的分类模型,对于单个类别的项目的二进制分类为正或负,基于使用包括正面示例和未标记的示例的训练集的训练 例子。 该模型仅包括一个超参数和仅一个阈值参数,其被选择以优化模型以限制正项目被分类为正,同时使被分类为阳性的未标记项目的数量最小化。

    System and method for scheduling online keyword auctions over multiple time periods subject to budget and query volume constraints
    10.
    发明申请
    System and method for scheduling online keyword auctions over multiple time periods subject to budget and query volume constraints 审中-公开
    在多个时间段内按预算和查询量限制调度在线关键字拍卖的系统和方法

    公开(公告)号:US20090112691A1

    公开(公告)日:2009-04-30

    申请号:US11981319

    申请日:2007-10-30

    IPC分类号: G06Q30/00 G06F17/30 G06Q10/00

    摘要: An improved system and method for scheduling online keyword auctions over multiple time periods subject to budget constraints is provided. A linear programming model of slates of advertisements may be created for predicting the volume and order in which queries may appear throughout multiple time periods for use in allocating bidders to auctions to optimize revenue of an auctioneer. Each slate of advertisements may represent a candidate set of advertisements in order of optimal revenue to an auctioneer. Linear programming using column generation with the keyword as a constraint and a bidder's budget as a constraint may be applied for each time period to generate a column that may be added to a linear programming model of slates of advertisements. Upon receiving a query request, a slate of advertisements for the time period may be output for sending to a web browser for display.

    摘要翻译: 提供了一种用于在多个时间段内按预算约束调度在线关键词拍卖的改进的系统和方法。 可以创建广告平板的线性规划模型,用于预测在多个时间段期间查询可能出现的音量和顺序,以用于将投标者分配给拍卖以优化拍卖者的收入。 广告的每一张广告可以以拍卖者的最佳收入的顺序代表一组候选广告。 可以对每个时间段应用使用关键字作为约束的列生成和作为约束的出价者预算的线性规划,以生成可以被添加到广告平面的线性规划模型的列。 在接收到查询请求时,可以输出该时间段的广告片,以便发送到web浏览器进行显示。