摘要:
A system and method for automated populating of an existing concept hierarchy of items with new items, using entropy as a measure of the correctness of a potential classification. User-defined concept hierarchies include, for example, document hierarchies such as directories for the Internet, library catalogues, patent databases and journals, and product hierarchies. These concept hierarchies can be huge and are usually maintained manually. An internet directory may have, for example, millions of Web sites, thousands of editors and hundreds of thousands of different categories. The method for populating a concept hierarchy includes calculating conditional ‘entropy’ values representing the randomness of distribution of classification attributes for the hierarchical set of classes if a new item is added to specific classes of the hierarchy and then selecting whichever class has the minimum randomness of distribution when calculated as a condition of insertion of the new data item.
摘要:
The invention describes a method and system to optimize network bandwidth and obtain greater efficiency in transmission of messages/data in, a client-server network. The invention proposes the use of clustering of client requests and the data items in such a manner so as to optimize the network transmission as well as reduce the cost of processing involved in sending and picking/pruning the data items at server and client end respectively.
摘要:
Provided are methods, apparatus and computer programs for evaluating the resilience, to structural changes in a data source, of a representative label representing a data element within the data source. Also disclosed are applications using a resilient representative label. For example, a representative label may represent a particular data field or other data element within a semi-structured data source—such as within XML or HTML Web pages. An estimate of resilience to changes can be used to determine whether a candidate representative label satisfies a required degree of resilience, or to enable selection of a label with the highest resilience score among a set of representative labels. The validated or selected representative label may then be used for data extraction, remaining usable despite the possibility of future changes to the structure of a Web page, or for template clustering/classification.
摘要:
A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognise factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.
摘要:
Provided are methods, apparatus and computer programs for evaluating the resilience, to structural changes in a data source, of a representative label representing a data element within the data source. Also disclosed are applications using a resilient representative label. For example, a representative label may represent a particular data field or other data element within a semi-structured data source - such as within XML or HTML Web pages. An estimate of resilience to changes can be used to determine whether a candidate representative label satisfies a required degree of resilience, or to enable selection of a label with the highest resilience score among a set of representative labels. The validated or selected representative label may then be used for data extraction, remaining usable despite the possibility of future changes to the structure of a Web page, or for template clustering/classification.