摘要:
The present invention provides a method and system of manipulating XML data in support of data mining. In an exemplary embodiment, the method and system include (1) storing the XML data in a network format to a buffer, thereby resulting in a stored network representation of the XML data and (2) selecting at least one feature of the XML data via a naive selection operating on the stored network representation of the XML data. In an exemplary embodiment, the method and system include (1) storing the XML data in a network format to a buffer, thereby resulting in a stored network representation of the XML data and (2) modifying at least one feature of the XML data via a naive modification operating on the stored network representation of the XML data. In an exemplary embodiment, the network format includes xtalk format.
摘要:
A search engine receives query terms from a client. In response, the search engine executes a search on a web directory to identify zero or more documents that match the query terms. The identified documents are associated with one or more categories. The search engine probabilistically selects one of the categories associated with the identified documents. Each message in a message database is also associated with one or more of the categories. The search engine accesses the message database and selects at least one message associated with the selected category. The search engine returns a web page containing references to the documents matching the query terms and the one or more messages selected from the message database to the client.
摘要:
A collaborative focused crawler crawls documents on a network locating documents that match multiple focus topics. The collaborative crawler comprises a fetcher and a focus engine. The fetcher prioritizes which documents to crawl based on a set of rules, obtains documents from the network, and outputs crawled documents to the focus engine. The focus engine determines whether a fetched document is relevant to any of the multiple focus topics. The focus engine determines whether fetched documents are disallowed. If a fetched document is disallowed, the present system may place the URL for that web document in a blacklist, a list of URLs that may not be crawled. URLs may be disallowed if they match a disallowed topic or if they fail a set of rules designed for a web space focus, for example, domain rules, IP address rules, and prefix rules.
摘要:
A collaborative focused crawler crawls documents on a network locating documents that match multiple focus topics. The collaborative crawler comprises a fetcher and a focus engine. The fetcher prioritizes which documents to crawl based on a set of rules, obtains documents from the network, and outputs crawled documents to the focus engine. The focus engine determines whether a fetched document is relevant to any of the multiple focus topics. The focus engine determines whether fetched documents are disallowed. If a fetched document is disallowed, the present system may place the URL for that web document in a blacklist, a list of URLs that may not be crawled. URLs may be disallowed if they match a disallowed topic or if they fail a set of rules designed for a web space focus, for example, domain rules, IP address rules, and prefix rules.
摘要:
A search engine receives query terms from a client. In response, the search engine executes a search on a web directory to identify zero or more documents that match the query terms. The identified documents are associated with one or more categories. The search engine probabilistically selects one of the categories associated with the identified documents. Each message in a message database is also associated with one or more of the categories. The search engine accesses the message database and selects at least one message associated with the selected category. The search engine returns a web page containing references to the documents matching the query terms and the one or more messages selected from the message database to the client.
摘要:
A system and related techniques permit a search service operator to access a variety of disparate relevance measures, and integrate those measures into idealized or unified data sets. A search service operator may employ self-learning networks to generate relevance rankings of Web site hits in response to user queries or searches, such as Boolean text or other searches. To improve the accuracy and quality of the rankings of results, the service provider may accept as inputs relevance measures created from query logs, from human-annotated search records, from independent commercial or other search sites, or from other sources and feed those measures to a normalization engine. That engine may normalize those relevance ratings to a common scale, such as quintiles, percentages or other scales or levels. The provider may then use that idealized or normalized combined measure to train the search algorithms or heuristics to arrive at more accurate results.
摘要:
A system and related techniques permit a search service operator to access a variety of disparate relevance measures, and integrate those measures into idealized or unified data sets. A search service operator may employ self-learning networks to generate relevance rankings of Web site hits in response to user queries or searches, such as Boolean text or other searches. To improve the accuracy and quality of the rankings of results, the service provider may accept as inputs relevance measures created from query logs, from human-annotated search records, from independent commercial or other search sites, or from other sources and feed those measures to a normalization engine. That engine may normalize those relevance ratings to a common scale, such as quintiles, percentages or other scales or levels. The provider may then use that idealized or normalized combined measure to, for example, train the search algorithms or heuristics to arrive at better or more accurate results.