Domain constraint path based data record extraction
    1.
    发明授权
    Domain constraint path based data record extraction 有权
    基于域约束路径的数据记录提取

    公开(公告)号:US09171080B2

    公开(公告)日:2015-10-27

    申请号:US13356241

    申请日:2012-01-23

    IPC分类号: G06F17/30 G06F17/22

    摘要: Described herein are techniques for extracting data records containing user-generated content from documents. The documents may be processed into document trees in which sub-trees represent the data records of the document. Domain constraints may be used to locate structured portions of the document tree. For example, anchor trees may be located as being sets of sibling sub-trees with similar tag paths that contain the domain constraints. The anchor trees may then be used to determine a record boundary (e.g., the start offset and length) of the data records. Finally, the data records may be extracted based on the anchor trees and the record boundaries.

    摘要翻译: 这里描述的是从文档中提取包含用户生成的内容的数据记录的技术。 文档可以被处理成文档树,其中子树表示文档的数据记录。 域约束可用于定位文档树的结构化部分。 例如,锚树可以被定位为具有包含域约束的类似标签路径的兄弟子树的集合。 然后可以使用锚树来确定数据记录的记录边界(例如,起始偏移和长度)。 最后,可以基于锚树和记录边界来提取数据记录。

    Domain Constraint Based Data Record Extraction
    2.
    发明申请
    Domain Constraint Based Data Record Extraction 有权
    基于域约束的数据记录提取

    公开(公告)号:US20120124077A1

    公开(公告)日:2012-05-17

    申请号:US12945517

    申请日:2010-11-12

    IPC分类号: G06F17/30

    CPC分类号: G06F17/227

    摘要: Embodiments for a Mining Data Records based on Anchor Trees (MiBAT) process are disclosed. In accordance with at least one embodiment, the MiBAT process extracts data records containing user-generated content from web documents. The web document is processed into a Document Object Model (DOM) tree in which sub-trees of the DOM tree represent the data records of the web document. Domain constraints are used to locate structured portions of the DOM tree. Anchor trees are then located as being sets of sibling sub-trees which contain the domain constraints. The anchor trees are then used to determine a record boundary (i.e. the start offset and length) of the data records. Finally, the data records are extracted based on the anchor trees and the record boundaries.

    摘要翻译: 公开了基于锚树(MiBAT)工艺的挖掘数据记录的实施例。 根据至少一个实施例,MiBAT处理从Web文档中提取包含用户生成的内容的数据记录。 Web文档被处理为文档对象模型(DOM)树,其中DOM树的子树表示Web文档的数据记录。 域约束用于定位DOM树的结构化部分。 然后,锚树被定位为包含域约束的兄弟子树的集合。 锚树然后用于确定数据记录的记录边界(即起始偏移和长度)。 最后,根据锚树和记录边界提取数据记录。

    Training a ranking component
    3.
    发明授权
    Training a ranking component 有权
    训练排名组成部分

    公开(公告)号:US07783629B2

    公开(公告)日:2010-08-24

    申请号:US11326283

    申请日:2006-01-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30616

    摘要: A query and a factoid type selection are received from a user. An index of passages, indexed based on factoids, is accessed and passages that are related to the query, and that have the selected factoid type, are retrieved. The retrieved passages are ranked and provided to the user based on a calculated score, in rank order.

    摘要翻译: 从用户接收到查询和事实类型选择。 访问基于事实的索引的段落索引,并检索与查询相关的段落,并且具有所选择的实例类型的段落。 检索到的段落按照排列顺序根据计算得分排列并提供给用户。

    SEARCHING QUESTIONS BASED ON TOPIC AND FOCUS
    4.
    发明申请
    SEARCHING QUESTIONS BASED ON TOPIC AND FOCUS 有权
    基于主题和焦点的搜索问题

    公开(公告)号:US20100030770A1

    公开(公告)日:2010-02-04

    申请号:US12185713

    申请日:2008-08-04

    IPC分类号: G06F7/06 G06F17/30

    CPC分类号: G06F17/30684

    摘要: A method and system for determining the relevance of questions to a queried question based on topics and focuses of the questions is provided. A question search system provides a collection of questions with topics and focuses. Upon receiving a queried question, the question search system identifies a queried topic and queried focus of the queried question. The question search system generates a score indicating the relevance of a question of the collection to the queried question based on a language model of the topic of the question and a language model of the focus of the question.

    摘要翻译: 提供了一种基于问题的主题和焦点来确定问题与查询问题的相关性的方法和系统。 问题搜索系统提供了一些问题的集合,主题和重点。 问题搜索系统在收到查询问题后,会识别被查询的主题,并查询查询问题的重点。 问题搜索系统基于问题的主题的语言模型和问题的重点的语言模型生成指示收集问题与查询问题的相关性的分数。

    RECOMMENDING QUESTIONS TO USERS OF COMMUNITY QIESTION ANSWERING
    5.
    发明申请
    RECOMMENDING QUESTIONS TO USERS OF COMMUNITY QIESTION ANSWERING 审中-公开
    对社区用户的建议问题答复

    公开(公告)号:US20090253112A1

    公开(公告)日:2009-10-08

    申请号:US12098457

    申请日:2008-04-07

    IPC分类号: G09B5/00

    CPC分类号: G06Q10/10 G06F16/3329

    摘要: The present system graphs topic terms in stored cQA questions and also converts a submitted question into a graph of topic terms. Topic terms that correspond to a question topic are delineated from topic terms that correspond to question focus. New questions are recommended to the user based on a comparison between the topics of the new questions and the topic of the submitted question as well as the focus of the new questions and the focus of the submitted question.

    摘要翻译: 本系统将存储的cQA问题中的主题条目进行图表,并将提交的问题转换为主题术语图表。 与问题主题对应的主题术语从与问题焦点相对应的主题术语中进行了描述。 基于新问题的主题与提交的问题的主题以及新问题的重点和提交的问题的重点之间的比较,向用户推荐新的问题。

    Smart Sentiment Classifier for Product Reviews
    6.
    发明申请
    Smart Sentiment Classifier for Product Reviews 审中-公开
    智能情感分类器的产品评论

    公开(公告)号:US20080249764A1

    公开(公告)日:2008-10-09

    申请号:US11950512

    申请日:2007-12-05

    IPC分类号: G06F17/27

    CPC分类号: G06F17/2785

    摘要: A sentiment classifier is described. In one implementation, a system applies both full text and complex feature analyses to sentences of a product review. Each analysis is weighted prior to linear combination into a final sentiment prediction. A full text model and a complex features model can be trained separately offline to support online full text analysis and complex features analysis. Complex features include opinion indicators, negation patterns, sentiment-specific sections of the product review, user ratings, sequence of text chunks, and sentence types and lengths. A Conditional Random Field (CRF) framework provides enhanced sentiment classification for each segment of a complex sentence to enhance sentiment prediction.

    摘要翻译: 描述情感分类器。 在一个实现中,系统将全文和复杂特征分析应用于产品评论的句子。 将每个分析在线性组合之前加权到最终情绪预测中。 全文模型和复杂特征模型可以离线进行培训,以支持在线全文分析和复杂特征分析。 复杂的功能包括意见指标,否定模式,产品评论中的情绪特定部分,用户评分,文本块的顺序以及句型和长度。 条件随机场(CRF)框架为复杂句子的每个段提供增强的情感分类,以增强情绪预测。

    Uncertainty reduction in collaborative bootstrapping
    7.
    发明申请
    Uncertainty reduction in collaborative bootstrapping 失效
    协同自举的不确定性降低

    公开(公告)号:US20050131850A1

    公开(公告)日:2005-06-16

    申请号:US10732741

    申请日:2003-12-10

    申请人: Yunbo Cao Hang Li

    发明人: Yunbo Cao Hang Li

    CPC分类号: G06N7/02

    摘要: Collaborative bootstrapping with uncertainty reduction for increased classifier performance. One classifier selects a portion of data that is uncertain with respect to the classifier and a second classifier labels the portion. Uncertainty reduction includes parallel processing where the second classifier also selects an uncertain portion for the first classifier to label. Uncertainty reduction can be incorporated into existing or new co-training or bootstrapping, including bilingual bootstrapping.

    摘要翻译: 具有不确定性降低的协作引导,增加分类器性能。 一个分类器选择相对于分类器不确定的一部分数据,而第二分类器标记该部分。 不确定性减少包括并行处理,其中第二分类器还选择第一分类器标记的不确定部分。 不确定度减少可以纳入现有的或新的共同训练或引导,包括双语引导。

    Determining utility of a question
    8.
    发明授权
    Determining utility of a question 有权
    确定问题的效用

    公开(公告)号:US08112269B2

    公开(公告)日:2012-02-07

    申请号:US12197991

    申请日:2008-08-25

    IPC分类号: G06F17/27 G06F17/21 G06F17/30

    CPC分类号: G06F17/277 G06F17/30654

    摘要: A question search system provides a collection of questions having words for use in evaluating the utility of the questions based on a language model. The question search system calculates n-gram probabilities for words within the questions of the collection. The n-gram probability of a word for a sequence of n−1 words indicates the probability of that word being next after that sequence in the collection of questions. The n-gram probabilities for the words of the collection represent the language model of the collection. The question search system calculates a language model utility score for each question within a collection that indicates the likelihood that a question is repeatedly asked by users. The question search system derives the language model utility score for a question from the n-gram probabilities of the words within that question.

    摘要翻译: 问题搜索系统提供了具有用于评估基于语言模型的问题的效用的单词的问题的集合。 问题搜索系统计算收集问题内的单词的n-gram概率。 n-1个词序列的单词的n-gram概率表示该词在该问题集合中的该序列之后的概率。 集合词的n-gram概率表示集合的语言模型。 问题搜索系统计算集合中每个问题的语言模型效用得分,其指示用户重复询问问题的可能性。 问题搜索系统从该问题中的单词的n-gram概率得出问题的语言模型效用得分。

    Uncertainty reduction in collaborative bootstrapping
    9.
    发明授权
    Uncertainty reduction in collaborative bootstrapping 失效
    协同自举的不确定性降低

    公开(公告)号:US07512582B2

    公开(公告)日:2009-03-31

    申请号:US10732741

    申请日:2003-12-10

    申请人: Yunbo Cao Hang Li

    发明人: Yunbo Cao Hang Li

    CPC分类号: G06N7/02

    摘要: Collaborative bootstrapping with uncertainty reduction for increased classifier performance. One classifier selects a portion of data that is uncertain with respect to the classifier and a second classifier labels the portion. Uncertainty reduction includes parallel processing where the second classifier also selects an uncertain portion for the first classifier to label. Uncertainty reduction can be incorporated into existing or new co-training or bootstrapping, including bilingual bootstrapping.

    摘要翻译: 具有不确定性降低的协作引导,增加分类器性能。 一个分类器选择相对于分类器不确定的一部分数据,而第二分类器标记该部分。 不确定性减少包括并行处理,其中第二分类器还选择第一分类器标记的不确定部分。 不确定度减少可以纳入现有的或新的共同训练或引导,包括双语引导。

    LEARNING A DOCUMENT RANKING USING A LOSS FUNCTION WITH A RANK PAIR OR A QUERY PARAMETER
    10.
    发明申请
    LEARNING A DOCUMENT RANKING USING A LOSS FUNCTION WITH A RANK PAIR OR A QUERY PARAMETER 有权
    学习一个文件排序使用一个失败的功能与排名对或一个查询参数

    公开(公告)号:US20080027925A1

    公开(公告)日:2008-01-31

    申请号:US11460838

    申请日:2006-07-28

    IPC分类号: G06F17/30

    摘要: A method and system for generating a ranking function to rank the relevance of documents to a query is provided. The ranking system learns a ranking function from training data that includes queries, resultant documents, and relevance of each document to its query. The ranking system learns a ranking function using the training data by weighting incorrect rankings of relevant documents more heavily than the incorrect rankings of not relevant documents so that more emphasis is placed on correctly ranking relevant documents. The ranking system may also learn a ranking function using the training data by normalizing the contribution of each query to the ranking function so that it is independent of the number of relevant documents of each query.

    摘要翻译: 提供了一种用于生成用于将文档与查询的相关性排序的排序函数的方法和系统。 排名系统从包括查询,结果文档以及每个文档与其查询的相关性的训练数据中学习排名函数。 排名系统使用训练数据通过对相关文件的不正确排名加权比不相关文件的不正确排名更多地学习排名功能,以便更加重视正确排列相关文件。 排序系统还可以通过将每个查询的贡献归一化到排序函数来学习使用训练数据的排序函数,使得它独立于每个查询的相关文档的数量。