摘要:
Disclosed are an index structure and a method of extending index which comprises: (a) performing indexing operations of generating inverted index for newly inserted data source in the memory; (b) if the number of source data involved in the indexing operations reaches a first threshold value k1, sequentially writing the generated inverted index into the first index subfile; (c) if the number of the smallest grids, or index groups, in the first index subfile reaches a second threshold value k2, merging the k2 grids into a larger grid and sequentially writing it into the second index subfile; and (d) if the number of the smallest grids in the second index subfile reaches a third threshold value k3, merging the k3 grids into a larger grid and sequentially writing it into the first index subfile. Because index updating mostly occurs in small grids, the number of I/O operations on large grids is reduced and thus the speed of index building and updating is increased. In addition, the threshold values k1, k2 and k3 may be automatically adjusted based on the usage of system resources.
摘要:
A method, system and program storage device are provided for extending an inverted index, which comprises first and second inverted index subfiles to increase the speed of establishing and updating inverted index files. The method includes performing ordered keyword indexing operations of generating an inverted index from data sources, in which a frequency of occurrence of keywords in each of the data sources is calculated, and writing each keyword, the data sources, and the frequency of occurrence of each keyword in the corresponding data sources to the inverted index. If a number of data sources involved in the indexing operations reaches a first threshold, then writing contents of the inverted index as a smallest grid into the first inverted index subfile. If a number of smallest grids in the first inverted index subfile reaches a second threshold, then merging the smallest grids into a merged grid and writing the merged grid into the second inverted index subfile. If the number of merged grids in the second inverted index subfile reaches a third threshold, then further merging the merged grids into a larger merged grid, and writing the larger merged grid back into the first inverted index subfile.
摘要:
A method and system for recognizing chemical names in a Chinese document. The method includes: receiving a Chinese document including chemical names; recognizing chemical name segments in the document; recognizing non-chemical name segments in the document; and combining the chemical name segments to get chemical names based on the recognized chemical name segments and non-chemical name segments. Specific embodiments of the present invention can effectively recognize chemical names from a chemical document.
摘要:
A method, system and computer program product for identifying an advertisement in a web page. The method includes the steps of: receiving a sample page; analyzing a source code of the sample page to obtain a node feature of the sample page; analyzing the node feature using a preset rule to find a sample advertisement in the sample page; analyzing a first link of the sample advertisement to obtain a link mode of the sample advertisement; and utilizing the link mode to identify a second advertisement, where at least one of the steps is carried out using a computer device so that the advertisement in a web page is identified.
摘要:
A method and system for expanding a document set as a search data source in the field of business related search. The present invention provides a method of expanding a seed document in a seed document set. The method includes identifying one or more entity words of the seed document; identifying one or more topic words identifying one or more topic words related to the based entity word in the seed document where the entity word is located; forming an entity word-topic word pair from each identified topic word and the entity word on the basis of which each topic word is identified; and obtaining one or more expanded documents through web by taking the entity word and topic word in the each entity word-topic word pair as key words at the same time. A system for executing the above method is also provided.
摘要:
A method and system for recognizing chemical names in a Chinese document. The method includes: receiving a Chinese document including chemical names; recognizing chemical name segments in the document; recognizing non-chemical name segments in the document; and combining the chemical name segments to get chemical names based on the recognized chemical name segments and non-chemical name segments. Specific embodiments of the present invention can effectively recognize chemical names from a chemical document.
摘要:
A method and system for filtering a candidate document in a candidate document set are provided. The method includes receiving one or more entity word—topic word pairs and identifying one or more entity words of the candidate document and topic words. The method also includes determining whether to add the candidate document into a filtered document set using the entity words and topic words in the given entity word—topic word pairs and the identified entity words and topic words in the candidate document. The method further includes adding the candidate document into a filtered document set in response to determining that the candidate document should be added into the filtered document set.
摘要:
A method and apparatus for preprocessing a plurality of documents for search and presenting search result and a system for searching documents that comprises these apparatuses. The search result, for example, includes at least one candidate document. The candidate document is assigned a tree structure representing its content. The tree structure includes at least one node. The method may include presenting at least a portion of the tree structure corresponded to the candidate document in the search result.
摘要:
The present invention provides a method and apparatus for preprocessing a plurality of documents for search and presenting search result and a system for searching documents that comprises these apparatuses. Wherein the search result comprises at least one candidate document, and each of the at least one document is assigned a tree structure representing its content which comprises at least one node, said method for presenting search result comprising: presenting at least a portion of the tree structure corresponded to said at least one candidate document in said search result.
摘要:
A method and system for expanding a document set as a search data source in the field of business related search. The present invention provides a method of expanding a seed document in a seed document set. The method includes identifying one or more entity words of the seed document; identifying one or more topic words identifying one or more topic words related to a based entity word in the seed document where the entity word is located; forming an entity word-topic word pair from each identified topic word and the entity word on the basis of which each topic word is identified; and obtaining one or more expanded documents by taking the entity word and topic word in each entity word-topic word pair as key words for web searching at the same time. A system for executing the above method is also provided.