Method and system of manipulating XML data in support of data mining
    1.
    发明申请
    Method and system of manipulating XML data in support of data mining 审中-公开
    操纵XML数据支持数据挖掘的方法和系统

    公开(公告)号:US20050144257A1

    公开(公告)日:2005-06-30

    申请号:US10734345

    申请日:2003-12-13

    摘要: The present invention provides a method and system of manipulating XML data in support of data mining. In an exemplary embodiment, the method and system include (1) storing the XML data in a network format to a buffer, thereby resulting in a stored network representation of the XML data and (2) selecting at least one feature of the XML data via a naive selection operating on the stored network representation of the XML data. In an exemplary embodiment, the method and system include (1) storing the XML data in a network format to a buffer, thereby resulting in a stored network representation of the XML data and (2) modifying at least one feature of the XML data via a naive modification operating on the stored network representation of the XML data. In an exemplary embodiment, the network format includes xtalk format.

    摘要翻译: 本发明提供了一种操纵XML数据以支持数据挖掘的方法和系统。 在一个示例性实施例中,该方法和系统包括(1)将网络格式的XML数据存储到缓冲器,由此导致存储的XML数据的网络表示,以及(2)通过以下方式选择XML数据的至少一个特征: 对存储的XML数据的网络表示进行操作的天真选择。 在一个示例性实施例中,该方法和系统包括(1)将网络格式的XML数据存储到缓冲器,由此导致存储的XML数据的网络表示,以及(2)修改XML数据的至少一个特征, 对存储的XML数据的网络表示进行操作的天真修改。 在示例性实施例中,网络格式包括xtalk格式。

    Search engine for selecting targeted messages
    2.
    发明授权
    Search engine for selecting targeted messages 有权
    用于选择目标消息的搜索引擎

    公开(公告)号:US06778975B1

    公开(公告)日:2004-08-17

    申请号:US09799863

    申请日:2001-03-05

    IPC分类号: G06F1730

    摘要: A search engine receives query terms from a client. In response, the search engine executes a search on a web directory to identify zero or more documents that match the query terms. The identified documents are associated with one or more categories. The search engine probabilistically selects one of the categories associated with the identified documents. Each message in a message database is also associated with one or more of the categories. The search engine accesses the message database and selects at least one message associated with the selected category. The search engine returns a web page containing references to the documents matching the query terms and the one or more messages selected from the message database to the client.

    摘要翻译: 搜索引擎从客户端接收查询条款。 作为响应,搜索引擎在web目录上执行搜索以识别与查询词匹配的零个或多个文档。 所识别的文档与一个或多个类别相关联。 搜索引擎概率地选择与识别的文档相关联的类别之一。 消息数据库中的每个消息也与一个或多个类别相关联。 搜索引擎访问消息数据库并选择与所选类别相关联的至少一个消息。 搜索引擎返回一个网页,其中包含与查询条款匹配的文档和从消息数据库中选择的一个或多个消息到客户端的引用。

    System, method, and service for collaborative focused crawling of documents on a network
    3.
    发明授权
    System, method, and service for collaborative focused crawling of documents on a network 失效
    系统,方法和服务,用于协调重点抓取网络上的文档

    公开(公告)号:US07552109B2

    公开(公告)日:2009-06-23

    申请号:US10686964

    申请日:2003-10-15

    IPC分类号: G06F17/30

    摘要: A collaborative focused crawler crawls documents on a network locating documents that match multiple focus topics. The collaborative crawler comprises a fetcher and a focus engine. The fetcher prioritizes which documents to crawl based on a set of rules, obtains documents from the network, and outputs crawled documents to the focus engine. The focus engine determines whether a fetched document is relevant to any of the multiple focus topics. The focus engine determines whether fetched documents are disallowed. If a fetched document is disallowed, the present system may place the URL for that web document in a blacklist, a list of URLs that may not be crawled. URLs may be disallowed if they match a disallowed topic or if they fail a set of rules designed for a web space focus, for example, domain rules, IP address rules, and prefix rules.

    摘要翻译: 协作重点的抓取工具可以在网络上抓取找到符合多个焦点主题的文档的文档。 协同爬行器包括抓取器和焦点引擎。 提取器根据一组规则来确定要爬取的文档,从网络获取文档,并将抓取的文档输出到焦点引擎。 焦点引擎确定获取的文档是否与多个焦点主题中的任何一个相关。 焦点引擎确定取消的文档是否被禁止。 如果不接受提取的文档,则本系统可将该Web文档的URL放置在黑名单中,可能无法抓取的URL列表。 如果URL与一个不允许的主题匹配,或者如果它们针对Web空间焦点设计的一组规则(例如,域规则,IP地址规则和前缀规则)失败,则可能不允许使用URL。

    System, Method, and service for collaborative focused crawling of documents on a network
    4.
    发明申请
    System, Method, and service for collaborative focused crawling of documents on a network 失效
    系统,方法和服务,用于协调集中抓取网络上的文档

    公开(公告)号:US20050086206A1

    公开(公告)日:2005-04-21

    申请号:US10686964

    申请日:2003-10-15

    IPC分类号: G06F17/30

    摘要: A collaborative focused crawler crawls documents on a network locating documents that match multiple focus topics. The collaborative crawler comprises a fetcher and a focus engine. The fetcher prioritizes which documents to crawl based on a set of rules, obtains documents from the network, and outputs crawled documents to the focus engine. The focus engine determines whether a fetched document is relevant to any of the multiple focus topics. The focus engine determines whether fetched documents are disallowed. If a fetched document is disallowed, the present system may place the URL for that web document in a blacklist, a list of URLs that may not be crawled. URLs may be disallowed if they match a disallowed topic or if they fail a set of rules designed for a web space focus, for example, domain rules, IP address rules, and prefix rules.

    摘要翻译: 协作重点的抓取工具可以在网络上抓取找到符合多个焦点主题的文档的文档。 协同爬行器包括抓取器和焦点引擎。 提取器根据一组规则来确定要爬取的文档,从网络获取文档,并将抓取的文档输出到焦点引擎。 焦点引擎确定获取的文档是否与多个焦点主题中的任何一个相关。 焦点引擎确定取消的文档是否被禁止。 如果不接受提取的文档,则本系统可将该Web文档的URL放置在黑名单中,可能无法抓取的URL列表。 如果URL与一个不允许的主题匹配,或者如果它们针对Web空间焦点设计的一组规则(例如,域规则,IP地址规则和前缀规则)失败,则可能不允许使用URL。

    Search engine for selecting targeted messages
    5.
    发明申请
    Search engine for selecting targeted messages 审中-公开
    用于选择目标消息的搜索引擎

    公开(公告)号:US20050065917A1

    公开(公告)日:2005-03-24

    申请号:US10840667

    申请日:2004-05-06

    IPC分类号: G06F17/30 G06F7/00

    摘要: A search engine receives query terms from a client. In response, the search engine executes a search on a web directory to identify zero or more documents that match the query terms. The identified documents are associated with one or more categories. The search engine probabilistically selects one of the categories associated with the identified documents. Each message in a message database is also associated with one or more of the categories. The search engine accesses the message database and selects at least one message associated with the selected category. The search engine returns a web page containing references to the documents matching the query terms and the one or more messages selected from the message database to the client.

    摘要翻译: 搜索引擎从客户端接收查询条款。 作为响应,搜索引擎在web目录上执行搜索以识别与查询词匹配的零个或多个文档。 所识别的文档与一个或多个类别相关联。 搜索引擎概率地选择与识别的文档相关联的类别之一。 消息数据库中的每个消息也与一个或多个类别相关联。 搜索引擎访问消息数据库并选择与所选类别相关联的至少一个消息。 搜索引擎返回一个网页,其中包含与查询条款匹配的文档和从消息数据库中选择的一个或多个消息到客户端的引用。

    System and method for generating normalized relevance measure for analysis of search results
    6.
    发明授权
    System and method for generating normalized relevance measure for analysis of search results 失效
    用于生成用于搜索结果分析的归一化相关性度量的系统和方法

    公开(公告)号:US07725463B2

    公开(公告)日:2010-05-25

    申请号:US10879002

    申请日:2004-06-30

    IPC分类号: G06F17/30

    摘要: A system and related techniques permit a search service operator to access a variety of disparate relevance measures, and integrate those measures into idealized or unified data sets. A search service operator may employ self-learning networks to generate relevance rankings of Web site hits in response to user queries or searches, such as Boolean text or other searches. To improve the accuracy and quality of the rankings of results, the service provider may accept as inputs relevance measures created from query logs, from human-annotated search records, from independent commercial or other search sites, or from other sources and feed those measures to a normalization engine. That engine may normalize those relevance ratings to a common scale, such as quintiles, percentages or other scales or levels. The provider may then use that idealized or normalized combined measure to train the search algorithms or heuristics to arrive at more accurate results.

    摘要翻译: 系统和相关技术允许搜索服务运营商访问各种不同的相关性度量,并将这些措施集成到理想化或统一的数据集中。 搜索服务运营商可以使用自学习网络来响应于诸如布尔文本或其他搜索的用户查询或搜索来生成网站命中的相关性排名。 为了提高结果排名的准确性和质量,服务提供商可以接受从查询日志,人为注释搜索记录,独立商业或其他搜索网站或其他来源创建的输入相关性度量,并将这些措施提供给 一个归一化引擎。 该引擎可将这些相关性评级标准化为一般规模,例如五分位数,百分比或其他比例或级别。 然后,提供商可以使用该理想化或归一化的组合度量来训练搜索算法或启发式来获得更准确的结果。

    System and method for generating normalized relevance measure for analysis of search results
    7.
    发明申请
    System and method for generating normalized relevance measure for analysis of search results 失效
    用于生成用于搜索结果分析的归一化相关性度量的系统和方法

    公开(公告)号:US20060004891A1

    公开(公告)日:2006-01-05

    申请号:US10879002

    申请日:2004-06-30

    IPC分类号: G06F17/30

    摘要: A system and related techniques permit a search service operator to access a variety of disparate relevance measures, and integrate those measures into idealized or unified data sets. A search service operator may employ self-learning networks to generate relevance rankings of Web site hits in response to user queries or searches, such as Boolean text or other searches. To improve the accuracy and quality of the rankings of results, the service provider may accept as inputs relevance measures created from query logs, from human-annotated search records, from independent commercial or other search sites, or from other sources and feed those measures to a normalization engine. That engine may normalize those relevance ratings to a common scale, such as quintiles, percentages or other scales or levels. The provider may then use that idealized or normalized combined measure to, for example, train the search algorithms or heuristics to arrive at better or more accurate results.

    摘要翻译: 系统和相关技术允许搜索服务运营商访问各种不同的相关性度量,并将这些措施集成到理想化或统一的数据集中。 搜索服务运营商可以使用自学习网络来响应于诸如布尔文本或其他搜索的用户查询或搜索来生成网站命中的相关性排名。 为了提高结果排名的准确性和质量,服务提供商可以接受从查询日志,人体注释搜索记录,独立商业或其他搜索网站或其他来源创建的输入相关性度量,并将这些措施提供给 一个归一化引擎。 该引擎可将这些相关性评级标准化为一般规模,例如五分位数,百分比或其他比例或级别。 然后,提供商可以使用该理想化或归一化的组合度量来例如训练搜索算法或启发式来获得更好或更准确的结果。