摘要:
[Object] Provided is a support system or a method for efficiently enabling generation of candidate synonyms, when a thesaurus usable in text mining is created. [Constitution] A candidate synonym acquisition device 130 acquires a set of candidate synonyms similar to an input word for each writer from data 110 for each writer, and acquires a set of candidate synonyms similar to the input word from a collective data 120. A generated candidate synonym set 140 is inputted to a candidate synonym determination device 150 to evaluate the candidate synonyms of the collective data 120. In the evaluation, the status of "absolute" is given to a word matching a word ranked first in the candidate synonyms for each writer and the status of "negative" is given to words matching words ranked second and lower therein.
摘要:
A statistical thesaurus is built dynamically, from the same text collection that is being searched, allowing improved generation of expanded query terms. The thesaurus is dynamic in that thesaurus records are collected, ranked, accessed, and applied dynamically. Thesaurus "records" are actually formed as indexed documents arranged in "collections". The collections are preferably distinguished based on text source. Each record has terms assembled in indexed groups which inherently reflect a ranking based on relevance to an initial query. After an initial query is received, the appropriate collection(s) of records may be searched by a conventional search and retrieval engine, the searches inherently returning records ranked by degree of relevance due the record indexing scheme. A record ranking scheme avoids contamination of relevant records by less relevant records. The record selection and the expansion query term generation processes are each divided into parallel threads. The separate threads correspond to respective text sources to enable the improved expansion query term generation to be provided in real time.
摘要:
An information retrieval system including a plurality of indices representative of information stored in the information retrieval system and a dynamic lexicon is disclosed. The system includes memory having a database stored therein, the database being logically divided to include the plurality of indices, an information database having information objects stored therein and a dynamic lexicon which includes a plurality of data items and groups of data items that appear in the information database. A predetermined time variable represents the last time the plurality of indices were reindexed. After changes are made to the lexicon, a time stamp is attached to each one of the plurality of changes to the lexicon to indicate when the change was made to the lexicon. At some specified time interval later, the reindexing process is invoked. This process involves selecting a subset of the plurality of changes made to the lexicon after the predetermined time variable, locating all information objects in the information database that are affected by the plurality of changes to the lexicon, reindexing the portions of the plurality of indices representative of the information objects affected by the changes to the lexicon to reflect the changes in the lexicon, and then updating the predetermined time variable to indicate changes to the lexicon have been processed. The foregoing process is repeated until all changes to the lexicon after the predetermined time have been applied to the plurality of indices.
摘要:
The present disclosure provides a system and method for managing data using semantic tags. The method may include providing a data model corresponding to a first set of tangible objects where the data model includes a first template class having both properties describing the set of tangible object and a set of semantic tags corresponding to the properties. The method may include receiving a class definition for a second template class for a second set of tangible objects where the second template class inherits, by the class definition, the properties and the sematic tags for the second set of tangible objects.
摘要:
A method and an apparatus for identifying synonym and utilizing such synonym to conduct search is disclosed. The disclosed method includes: obtaining arbitrary two words to be identified; determining whether a shortest edit distance between the two words less than or equal to an edit distance threshold; determining whether the two words to be identified exist in a preset knowledge database, and if an answer is yes then searching a smallest granularity type with highest weight value for each word in the knowledge database; and if the two word have the same smallest granularity type with highest weight value, then determining such two words are synonyms, or non-synonym otherwise. The disclosed techniques greatly improve accuracy of synonym identification and guarantee effect of synonym identification.
摘要:
A search query of a search word entered is received by the user, the received search queries are stored in accordance with reception order in a search query storing means (12a), a preceding search query whose reception order is earlier than that of the received search query is extracted from the search query storing means on the basis of a preset search query extracting condition, a preceding search word constructing the extracted preceding search query and a search word constructing the received search query are stored as a character string set in a character string set storing means (12d), a character string set having the search word which is the same or similar to the preceding search word are extracted from the character string set storing means in accordance with a preset character string set extraction start condition (S51); a character set is specified as a related word from the extracted character string set on the basis of a preset registration condition (S53), and the specified character string set is registered as related words into a related-word database (S54).
摘要:
A related-word candidate group (12b) obtained by extracting candidates of a related word on the basis of a predetermined condition from a search query log (12a) is generated (S1 to S4), a search query of a search word entered by the user is received (S10), a partial character string is generated from a character string of the search word (S13), on the basis of the partial character strings, a candidate character string is extracted from the related-word candidate group (S14), a suitability score of the candidate character string is calculated (S16), the candidate character strings are ranked in order of the scores (S17), a reference line L1 of a suitability score for the ranking is generated on the basis of the suitability score and the ranking (S18), a candidate character string whose suitability score is apart from the reference line by a preset threshold or larger is extracted as a registration character string to be registered as a related word (S19), and the extracted registration character string and the search word are registered as related words into the related-word DB 12c (S20).
摘要:
The invention makes it possible to detect the characteristics of text data, and to analogize potential hidden meaning in the text data. A word-cutting unit 3 performs a word-cutting process on the text data input from the input unit 1, a syntax-analysis unit 4 performs syntax analysis and a thesaurus-creation unit 5 creates thesauruses from the results, then after performing word cutting and syntax analysis again, a thesaurus-sorting unit 7 performs sorting, and a frequency-of-appearance unit calculates the frequency of appearance of the thesauruses, a correlation-coefficient-calculation unit 11 calculates correlation coefficients between thesauruses, a correlation-coefficient-total-calculation unit 13 for each thesaurus calculates the total of the correlation coefficients for each thesaurus, the graph-creation-display unit 15 creates a graph based on the frequency of appearance and total of the correlation coefficients for each thesaurus.