Hierarchical conditional random fields for web extraction
    31.
    发明授权
    Hierarchical conditional random fields for web extraction 失效
    Web提取的分层条件随机字段

    公开(公告)号:US07720830B2

    公开(公告)日:2010-05-18

    申请号:US11461400

    申请日:2006-07-31

    CPC分类号: G06F17/3089 G06F17/30994

    摘要: A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.

    摘要翻译: 提供了一种用于标记信息页面的对象信息的方法和系统。 标签系统基于对象记录中的对象元素的标签来识别信息页面的对象记录,并且基于包含对象元素的对象记录的标识来标记对象元素。 为了识别记录并标记元素,标签系统生成信息页的块的分层表示。 标签系统通过块的层次传播记录标签和元素标签的概率相关信息来识别记录中的记录和元素。 标签系统为每个块生成特征向量以表示块,并且基于从与相关块相关联的特征向量导出的分数来计算块正确的标签的概率。 标签系统搜索具有最高准确概率的记录和元素的标签。

    Determining relevance of documents to a query based on identifier distance
    32.
    发明授权
    Determining relevance of documents to a query based on identifier distance 有权
    根据标识符距离确定文档与查询的相关性

    公开(公告)号:US07630964B2

    公开(公告)日:2009-12-08

    申请号:US11273624

    申请日:2005-11-14

    IPC分类号: G06F7/00 G06F17/30

    摘要: A method and system for determining relevance of a document to a query based on identifier match distance is provided. The relevance system analyzes a training set of queries and documents to determine the relationship between identifier match distance and relevance of a document to a query. The identifier match distance indicates the distance from the end of an identifier of a document to an identifier term that matches a query term. The relevance system generates a prior relevance probability that a document with a certain identifier match distance is relevant to a query. The relevance system uses the prior relevance probabilities to determine relevance of documents to queries based on identifier match distance.

    摘要翻译: 提供了一种用于基于标识符匹配距离来确定文档与查询的相关性的方法和系统。 相关系统分析查询和文档的训练集,以确定标识符匹配距离与文档与查询的相关性之间的关系。 标识符匹配距离指示从文档的标识符的末尾到与查询项匹配的标识符项的距离。 相关系统产生具有与某个标识符匹配距离的文档与查询相关的先前相关概率。 相关系统使用先前的相关性概率来确定基于标识符匹配距离的文档与查询的相关性。

    Information classification paradigm
    33.
    发明授权
    Information classification paradigm 有权
    信息分类范式

    公开(公告)号:US07529748B2

    公开(公告)日:2009-05-05

    申请号:US11276818

    申请日:2006-03-15

    IPC分类号: G06F17/30

    摘要: A mechanism to classify source documents into one of two categories, either likely to contain desired information or unlikely to contain desired information. Generally some form of rules based classification in conjunction with deeper analysis using advanced techniques on difficult cases is utilized. The rules based classification is generally good for eliminating cases from further consideration and for identifying documents of interest based on generally discernable relationships between data or based on the presence or absence of data. The deeper analysis is used to uncover more complex relationships between data that may identify documents of interest. Portions of the process may use the entire document while other portions of the process may use only a portion of the document.

    摘要翻译: 将源文档分类为两个类别之一的机制,可能包含所需信息或不太可能包含所需信息。 通常使用某种形式的基于规则的分类,结合使用先进技术在困难案例上进行更深入的分析。 基于规则的分类通常对于消除进一步考虑的情况以及基于数据之间的一般可辨别的关系或基于数据的存在或不存在来识别感兴趣的文档是有益的。 更深入的分析用于发现可能识别感兴趣文档的数据之间更复杂的关系。 过程的一部分可以使用整个文档,而进程的其他部分可以仅使用文档的一部分。

    Retrieval of Structured Documents
    34.
    发明申请
    Retrieval of Structured Documents 有权
    结构化文件的检索

    公开(公告)号:US20090012956A1

    公开(公告)日:2009-01-08

    申请号:US12211793

    申请日:2008-09-16

    申请人: Ji-Rong Wen Hang Cui

    发明人: Ji-Rong Wen Hang Cui

    IPC分类号: G06F17/30 G06F7/38

    摘要: This disclosure relates to performing a query for a search term of a database containing a plurality of structured documents. Those structured documents that do not include the search term are ferreted or filtered out during an initial search. Matched structured documents which are those structured documents that do contain the search term are evaluated by ranking the individual elements based on how well each individual element matches the search term, and indicating to the user the ranking of the individual elements wherein the individual elements can be accessed by the user.

    摘要翻译: 本公开涉及对包含多个结构化文档的数据库执行关于搜索项的查询。 在初始搜索期间,不包括搜索条件的结构化文档被转移或过滤掉。 通过基于每个单独元素与搜索项匹配的程度对各个元素进行排名来评估包含搜索词的那些结构化文档的匹配结构化文档,并向用户指示各个元素的排名,其中各个元素可以是 由用户访问

    Pseudo-Anchor Text Extraction for Vertical Search
    35.
    发明申请
    Pseudo-Anchor Text Extraction for Vertical Search 失效
    用于垂直搜索的伪锚文本提取

    公开(公告)号:US20080215563A1

    公开(公告)日:2008-09-04

    申请号:US11681682

    申请日:2007-03-02

    IPC分类号: G06F17/30

    摘要: A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help ranking the objects in a search result to improve search performance. Method may be used in vertical search of objects such as published articles, products and images that lack explicit URL and anchor text information.

    摘要翻译: 搜索方法使用与搜索对象相关联的伪锚文本来改善搜索性能。 伪锚文本可以与来自诸如文档集合的数字语料库的搜索对象(诸如伪URL)的标识符组合提取。 优选地,使用基于机器学习的方法从候选锚块中提取每个对象的伪锚文本。 伪锚文本可用于搜索,并用于帮助对搜索结果中的对象进行排名以提高搜索性能。 方法可用于垂直搜索诸如已发表的文章,产品和图像之类的对象,缺少明确的URL和锚文本信息。

    Method and system for troubleshooting a misconfiguration of a computer system based on product support services information
    36.
    发明授权
    Method and system for troubleshooting a misconfiguration of a computer system based on product support services information 有权
    基于产品支持服务信息对计算机系统配置错误进行故障排除的方法和系统

    公开(公告)号:US07389444B2

    公开(公告)日:2008-06-17

    申请号:US10899939

    申请日:2004-07-27

    IPC分类号: G06F11/00

    CPC分类号: G06Q10/10

    摘要: A method and system for ranking possible causes of a component exhibiting a certain behavior is provided. In one embodiment, a troubleshooting system ranks candidate configuration parameters that may be causing a software application to exhibit an undesired behavior using support information relating to problems resulting from the settings of configuration parameters. The support information may be collected from problem reports generated by product support services personnel when troubleshooting problems that users encounter with the application. The troubleshooting system ranks the candidate configuration parameters as likely causing the application to exhibit the undesired behavior based on analysis of the support information.

    摘要翻译: 提供了一种用于对表现出某种行为的部件的可能原因进行排序的方法和系统。 在一个实施例中,故障排除系统对可能导致软件应用程序使用与由配置参数的设置产生的问题有关的支持信息来展示不期望行为的候选配置参数进行排序。 当对用户遇到的应用程序遇到的问题进行故障排除时,支持信息可以从产品支持服务人员生成的问题报告中收集。 故障排除系统将候选配置参数排列在可能的基础上,导致应用程序基于对支持信息的分析而展示不期望的行为。

    Method and system for calculating importance of a block within a display page
    37.
    发明授权
    Method and system for calculating importance of a block within a display page 失效
    用于计算显示页面中块的重要性的方法和系统

    公开(公告)号:US07363279B2

    公开(公告)日:2008-04-22

    申请号:US10834639

    申请日:2004-04-29

    摘要: A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a web page. A block of a web page represents an area of the web page that appears to relate to a similar topic. The importance system provides the characteristics or features of a block to an importance function that generates an indication of the importance of that block to its web page. The importance system “learns” the importance function by generating a model based on the features of blocks and the user-specified importance of those blocks. To learn the importance function, the importance system asks users to provide an indication of the importance of blocks of web pages in a collection of web pages.

    摘要翻译: 一种用于识别显示页面的信息区域的重要性的方法和系统。 重要性系统识别网页的信息区域或块。 网页的一个块表示网页的与类似主题相关的区域。 重要性系统将块的特征或特征提供给重要性功能,其产生该块对其网页的重要性的指示。 重要性系统通过基于块的特征和用户指定的这些块的重要性生成模型来“学习”重要性功能。 为了学习重要性功能,重要性系统要求用户提供网页集合中网页块重要性的指示。

    Event-based automated diagnosis of known problems
    38.
    发明授权
    Event-based automated diagnosis of known problems 有权
    基于事件的自动诊断已知问题

    公开(公告)号:US07337092B2

    公开(公告)日:2008-02-26

    申请号:US11556638

    申请日:2006-11-03

    IPC分类号: G06F19/00 G06F17/40

    CPC分类号: G06F11/079 G06F11/0715

    摘要: System events preceding occurrence of a problem are likely to be similar to events preceding occurrence of the same problem at other times or on other systems. Thus, the cause of a problem may be identified by comparing a trace of events preceding occurrence of the problem with previously diagnosed traces. Traces of events preceding occurrences of a problem arising from a known cause are reduced to a series of descriptive elements. These elements are aligned to correlate differently timed but otherwise similar traces of events, converted into symbolic representations, and archived. A trace of events leading to an undiagnosed a problem similarly is converted to a symbolic representation. The representation of the undiagnosed trace is then compared to the archived representations to identify a similar archived representation. The cause of the similar archived representation is presented as a diagnosis of the problem.

    摘要翻译: 发生问题之前的系统事件可能类似于在其他时间或其他系统上出现相同问题的事件。 因此,可以通过将问题发生之前的事件的跟踪与先前诊断的痕迹进行比较来识别问题的原因。 在已知原因引起的问题发生之前的事件跟踪被减少到一系列描述性元素。 这些元素被对齐以将不同的定时但相似的事件轨迹相关联,转换成符号表示和归档。 类似地导致未定义的问题的事件的轨迹被转换为符号表示。 然后将未确定的跟踪的表示与归档表示进行比较,以识别类似的归档表示。 类似归档表示的原因被提出作为问题的诊断。

    Determining relevance of documents to a query based on identifier distance
    39.
    发明申请
    Determining relevance of documents to a query based on identifier distance 有权
    根据标识符距离确定文档与查询的相关性

    公开(公告)号:US20070112734A1

    公开(公告)日:2007-05-17

    申请号:US11273624

    申请日:2005-11-14

    IPC分类号: G06F17/30

    摘要: A method and system for determining relevance of a document to a query based on identifier match distance is provided. The relevance system analyzes a training set of queries and documents to determine the relationship between identifier match distance and relevance of a document to a query. The identifier match distance indicates the distance from the end of an identifier of a document to an identifier term that matches a query term. The relevance system generates a prior relevance probability that a document with a certain identifier match distance is relevant to a query. The relevance system uses the prior relevance probabilities to determine relevance of documents to queries based on identifier match distance.

    摘要翻译: 提供了一种用于基于标识符匹配距离来确定文档与查询的相关性的方法和系统。 相关系统分析查询和文档的训练集,以确定标识符匹配距离与文档与查询的相关性之间的关系。 标识符匹配距离指示从文档的标识符的末尾到与查询项匹配的标识符项的距离。 相关系统产生具有与某个标识符匹配距离的文档与查询相关的先前相关概率。 相关系统使用先前的相关性概率来确定基于标识符匹配距离的文档与查询的相关性。