Abstract:
A system and associated method for evaluating cross-domain clusterability upon a target domain and a source domain. The cross-domain clusterability is calculated as a linear combination of a target clusterability and a source-target pair matchability, by use of a trade-off parameter that determines relative contribution of the target clusterability and the source-target pair matchability. The target clusterability quantifies how clusterable the target domain is. The source-target pair matchability is calculated as an average of a target-side matchability and a source-side matchability, which quantifies how well target centroids of the target domain are aligned with the source centroids and how well source centroids of the source domain are aligned with the target centroids, respectively.
Abstract:
A system and method for automated populating of an existing concept hierarchy of items with new items, using entropy as a measure of the correctness of a potential classification. User-defined concept hierarchies include, for example, document hierarchies such as directories for the Internet, library catalogues, patent databases and journals, and product hierarchies. These concept hierarchies can be huge and are usually maintained manually. An internet directory may have, for example, millions of Web sites, thousands of editors and hundreds of thousands of different categories. The method for populating a concept hierarchy includes calculating conditional ‘entropy’ values representing the randomness of distribution of classification attributes for the hierarchical set of classes if a new item is added to specific classes of the hierarchy and then selecting whichever class has the minimum randomness of distribution when calculated as a condition of insertion of the new data item.
Abstract:
Provided are methods, apparatus and computer programs for evaluating the resilience, to structural changes in a data source, of a representative label representing a data element within the data source. Also disclosed are applications using a resilient representative label. For example, a representative label may represent a particular data field or other data element within a semi-structured data source—such as within XML or HTML Web pages. An estimate of resilience to changes can be used to determine whether a candidate representative label satisfies a required degree of resilience, or to enable selection of a label with the highest resilience score among a set of representative labels. The validated or selected representative label may then be used for data extraction, remaining usable despite the possibility of future changes to the structure of a Web page, or for template clustering/classification.
Abstract:
A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognise factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.
Abstract:
Provided are methods, apparatus and computer programs for evaluating the resilience, to structural changes in a data source, of a representative label representing a data element within the data source. Also disclosed are applications using a resilient representative label. For example, a representative label may represent a particular data field or other data element within a semi-structured data source - such as within XML or HTML Web pages. An estimate of resilience to changes can be used to determine whether a candidate representative label satisfies a required degree of resilience, or to enable selection of a label with the highest resilience score among a set of representative labels. The validated or selected representative label may then be used for data extraction, remaining usable despite the possibility of future changes to the structure of a Web page, or for template clustering/classification.
Abstract:
Techniques for detecting one or more documents that are duplicate or near-duplicate of a first document are provided. The techniques include obtaining a first document, obtaining one or more additional documents, retrieving a set of one or more document signatures for each document, and detecting one or more documents that are duplicate or near-duplicate of the first document by detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document, wherein detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document comprises dynamically using at least one of a user-configurable similarity definition and a user-configurable similarity threshold value.
Abstract:
Methods and arrangements for enhancing content in discussion forums. Access to an online discussion is provided. A posting by an author participating in the discussion is accepted, and a recommendation is automatically produced for the author for amending the posting to increase the likelihood of response to the posting by other individuals participating in the discussion.
Abstract:
Methods and arrangements for enhancing content in discussion forums. Access to an online discussion is provided. A posting by an author participating in the discussion is accepted, and a recommendation is automatically produced for the author for amending the posting to increase the likelihood of response to the posting by other individuals participating in the discussion.
Abstract:
Techniques, an apparatus and an article of manufacture identifying one or more utterances that are likely to carry the intent of a speaker, from a conversation between two or more parties. A method includes obtaining an input of a set of utterances in chronological order from a conversation between two or more parties, computing an intent confidence value of each utterance by summing intent confidence scores from each of the constituent words of the utterance, wherein intent confidence scores capture each word's influence on the subsequent utterances in the conversation based on (i) the uniqueness of the word in the conversation and (ii) the number of times the word subsequently occurs in the conversation, and generating a ranked order of the utterances from highest to lowest intent confidence value, wherein the highest intent value corresponds to the utterance which is most likely to carry intent of the speaker.
Abstract:
A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained in the address(es), and standardize the address(es) to an acceptable format, with one or more synonyms and/or other elements being added to or taken away from the input address(es) as part of the standardization process.