Index and Method for Extending and Querying Index
    1.
    发明申请
    Index and Method for Extending and Querying Index 失效
    扩展和查询索引的索引和方法

    公开(公告)号:US20070124277A1

    公开(公告)日:2007-05-31

    申请号:US11562495

    申请日:2006-11-22

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30622

    摘要: Disclosed are an index structure and a method of extending index which comprises: (a) performing indexing operations of generating inverted index for newly inserted data source in the memory; (b) if the number of source data involved in the indexing operations reaches a first threshold value k1, sequentially writing the generated inverted index into the first index subfile; (c) if the number of the smallest grids, or index groups, in the first index subfile reaches a second threshold value k2, merging the k2 grids into a larger grid and sequentially writing it into the second index subfile; and (d) if the number of the smallest grids in the second index subfile reaches a third threshold value k3, merging the k3 grids into a larger grid and sequentially writing it into the first index subfile. Because index updating mostly occurs in small grids, the number of I/O operations on large grids is reduced and thus the speed of index building and updating is increased. In addition, the threshold values k1, k2 and k3 may be automatically adjusted based on the usage of system resources.

    摘要翻译: 公开了一种索引结构和扩展索引的方法,包括:(a)对存储器中新插入的数据源生成反向索引的索引操作; (b)如果在索引操作中涉及的源数据的数量达到第一阈值k 1,则将生成的反向索引顺序写入第一索引子文件; (c)如果第一索引子文件中的最小网格或索引组的数量达到第二阈值k 2,则将k个网格合并成较大的网格并将其顺序地写入第二索​​引子文件中; 和(d)如果第二索引子文件中的最小网格的数量达到第三阈值k 3,则将k 3个网格合并到较大的网格中并将其顺序写入第一索引子文件中。 由于索引更新主要发生在小网格中,因此大网格上的I / O操作数量减少,因此索引构建和更新速度提高。 此外,可以基于系统资源的使用自动调整阈值k 1,k 2和k 3。

    Index and method for extending and querying index
    2.
    发明授权
    Index and method for extending and querying index 失效
    扩展和查询索引的索引和方法

    公开(公告)号:US07689574B2

    公开(公告)日:2010-03-30

    申请号:US11562495

    申请日:2006-11-22

    IPC分类号: G06F17/00 G06F15/16 G06F3/00

    CPC分类号: G06F17/30622

    摘要: A method, system and program storage device are provided for extending an inverted index, which comprises first and second inverted index subfiles to increase the speed of establishing and updating inverted index files. The method includes performing ordered keyword indexing operations of generating an inverted index from data sources, in which a frequency of occurrence of keywords in each of the data sources is calculated, and writing each keyword, the data sources, and the frequency of occurrence of each keyword in the corresponding data sources to the inverted index. If a number of data sources involved in the indexing operations reaches a first threshold, then writing contents of the inverted index as a smallest grid into the first inverted index subfile. If a number of smallest grids in the first inverted index subfile reaches a second threshold, then merging the smallest grids into a merged grid and writing the merged grid into the second inverted index subfile. If the number of merged grids in the second inverted index subfile reaches a third threshold, then further merging the merged grids into a larger merged grid, and writing the larger merged grid back into the first inverted index subfile.

    摘要翻译: 提供了一种用于扩展反向索引的方法,系统和程序存储装置,其包括第一和第二反向索引子文件,以增加建立和更新反向索引文件的速度。 该方法包括执行从数据源生成反向索引的有序关键字索引操作,其中计算每个数据源中的关键字的发生频率,并且写入每个关键字,数据源和每个数据源的发生频率 关键字在相应的数据源中反转索引。 如果涉及索引操作的数据源数目达到第一阈值,则将反向索引的内容作为最小格网写入第一反向索引子文件中。 如果第一反向索引子文件中的最小格数达到第二阈值,则将最小网格合并到合并的网格中,并将合并的网格写入第二个反向索引子文件。 如果第二反向索引子文件中的合并网格数达到第三阈值,则将合并的网格进一步合并到较大的合并网格中,并将较大的合并网格写回第一个反向索引子文件。

    RECOGNIZING CHEMICAL NAMES IN A CHINESE DOCUMENT
    3.
    发明申请
    RECOGNIZING CHEMICAL NAMES IN A CHINESE DOCUMENT 有权
    认可中文文献中的化学名称

    公开(公告)号:US20130054226A1

    公开(公告)日:2013-02-28

    申请号:US13598692

    申请日:2012-08-30

    IPC分类号: G06F17/20

    CPC分类号: G06F17/2765

    摘要: A method and system for recognizing chemical names in a Chinese document. The method includes: receiving a Chinese document including chemical names; recognizing chemical name segments in the document; recognizing non-chemical name segments in the document; and combining the chemical name segments to get chemical names based on the recognized chemical name segments and non-chemical name segments. Specific embodiments of the present invention can effectively recognize chemical names from a chemical document.

    摘要翻译: 一种用于识别中文文献中化学名称的方法和系统。 该方法包括:接收包含化学名称的中文文件; 识别文件中的化学名称部分; 识别文件中的非化学名称部分; 并结合化学名称段,根据公认的化学名称段和非化学名称段获得化学名称。 本发明的具体实施方案可以有效地识别来自化学文献的化学名称。

    METHOD AND SYSTEM FOR IDENTIFYING ADVERTISEMENT IN WEB PAGE
    4.
    发明申请
    METHOD AND SYSTEM FOR IDENTIFYING ADVERTISEMENT IN WEB PAGE 有权
    网页中识别广告的方法和系统

    公开(公告)号:US20110078558A1

    公开(公告)日:2011-03-31

    申请号:US12893187

    申请日:2010-09-29

    IPC分类号: G06F17/00

    摘要: A method, system and computer program product for identifying an advertisement in a web page. The method includes the steps of: receiving a sample page; analyzing a source code of the sample page to obtain a node feature of the sample page; analyzing the node feature using a preset rule to find a sample advertisement in the sample page; analyzing a first link of the sample advertisement to obtain a link mode of the sample advertisement; and utilizing the link mode to identify a second advertisement, where at least one of the steps is carried out using a computer device so that the advertisement in a web page is identified.

    摘要翻译: 一种用于识别网页中的广告的方法,系统和计算机程序产品。 该方法包括以下步骤:接收样本页; 分析样本页面的源代码以获得样本页面的节点特征; 使用预设规则分析节点特征以在样本页面中找到样本广告; 分析样本广告的第一链接以获得样本广告的链接模式; 以及利用所述链接模式来识别第二广告,其中使用计算机设备执行所述步骤中的至少一个,从而识别网页中的广告。

    DOCUMENT PROCESSING METHOD AND SYSTEM
    5.
    发明申请
    DOCUMENT PROCESSING METHOD AND SYSTEM 有权
    文件处理方法和系统

    公开(公告)号:US20100306248A1

    公开(公告)日:2010-12-02

    申请号:US12786557

    申请日:2010-05-25

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30716 G06F17/30011

    摘要: A method and system for expanding a document set as a search data source in the field of business related search. The present invention provides a method of expanding a seed document in a seed document set. The method includes identifying one or more entity words of the seed document; identifying one or more topic words identifying one or more topic words related to the based entity word in the seed document where the entity word is located; forming an entity word-topic word pair from each identified topic word and the entity word on the basis of which each topic word is identified; and obtaining one or more expanded documents through web by taking the entity word and topic word in the each entity word-topic word pair as key words at the same time. A system for executing the above method is also provided.

    摘要翻译: 一种在业务相关搜索领域中扩展作为搜索数据源的文档集的方法和系统。 本发明提供一种在种子文档集中扩展种子文档的方法。 该方法包括识别种子文档的一个或多个实体单词; 识别识别在所述实体字所在的种子文档中与所述基于实体字相关的一个或多个主题词的一个或多个主题词; 从每个识别的主题词和实体单词形成实体单词对,并根据该单词识别每个主题词; 并通过网络获取一个或多个扩展文档,通过将每个实体词主题词对中的实体单词和主题词作为关键词同时获取。 还提供了用于执行上述方法的系统。

    Recognizing chemical names in a chinese document
    6.
    发明授权
    Recognizing chemical names in a chinese document 有权
    在中文文件中认可化学名称

    公开(公告)号:US09575957B2

    公开(公告)日:2017-02-21

    申请号:US13598692

    申请日:2012-08-30

    IPC分类号: G06F17/27

    CPC分类号: G06F17/2765

    摘要: A method and system for recognizing chemical names in a Chinese document. The method includes: receiving a Chinese document including chemical names; recognizing chemical name segments in the document; recognizing non-chemical name segments in the document; and combining the chemical name segments to get chemical names based on the recognized chemical name segments and non-chemical name segments. Specific embodiments of the present invention can effectively recognize chemical names from a chemical document.

    摘要翻译: 一种用于识别中文文献中化学名称的方法和系统。 该方法包括:接收包含化学名称的中文文件; 识别文件中的化学名称部分; 识别文件中的非化学名称部分; 并结合化学名称段,根据公认的化学名称段和非化学名称段获得化学名称。 本发明的具体实施方案可以有效地识别来自化学文献的化学名称。

    Document processing method and system
    7.
    发明授权
    Document processing method and system 有权
    文件处理方法和系统

    公开(公告)号:US09058383B2

    公开(公告)日:2015-06-16

    申请号:US13608438

    申请日:2012-09-10

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30716 G06F17/30011

    摘要: A method and system for filtering a candidate document in a candidate document set are provided. The method includes receiving one or more entity word—topic word pairs and identifying one or more entity words of the candidate document and topic words. The method also includes determining whether to add the candidate document into a filtered document set using the entity words and topic words in the given entity word—topic word pairs and the identified entity words and topic words in the candidate document. The method further includes adding the candidate document into a filtered document set in response to determining that the candidate document should be added into the filtered document set.

    摘要翻译: 提供了一种用于过滤候选文档集中候选文档的方法和系统。 该方法包括接收一个或多个实体字主题词对并识别候选文档和主题词的一个或多个实体单词。 该方法还包括使用给定实体字主题词对中的实体单词和主题词以及候选文档中所识别的实体单词和主题词来确定是否将候选文档添加到过滤文档集中。 该方法还包括响应于确定候选文档应该被添加到经过滤的文档集中而将候选文档添加到经过过滤的文档集中。

    Method and apparatus for preprocessing a plurality of documents for search and for presenting search result
    8.
    发明授权
    Method and apparatus for preprocessing a plurality of documents for search and for presenting search result 有权
    用于预处理多个用于搜索的文档和用于呈现搜索结果的方法和装置

    公开(公告)号:US08838650B2

    公开(公告)日:2014-09-16

    申请号:US11847285

    申请日:2007-08-29

    IPC分类号: G06F7/00

    CPC分类号: G06F17/30864

    摘要: A method and apparatus for preprocessing a plurality of documents for search and presenting search result and a system for searching documents that comprises these apparatuses. The search result, for example, includes at least one candidate document. The candidate document is assigned a tree structure representing its content. The tree structure includes at least one node. The method may include presenting at least a portion of the tree structure corresponded to the candidate document in the search result.

    摘要翻译: 一种用于预处理用于搜索和呈现搜索结果的多个文档的方法和装置,以及用于搜索包括这些装置的文档的系统。 搜索结果例如包括至少一个候选文档。 候选文件被分配一个表示其内容的树结构。 树结构包括至少一个节点。 该方法可以包括在搜索结果中呈现对应于候选文档的树结构的至少一部分。

    METHOD AND APPARATUS FOR PREPROCESSING A PLURALITY OF DOCUMENTS FOR SEARCH AND FOR PRESENTING SEARCH RESULT
    9.
    发明申请
    METHOD AND APPARATUS FOR PREPROCESSING A PLURALITY OF DOCUMENTS FOR SEARCH AND FOR PRESENTING SEARCH RESULT 有权
    用于搜索和提供搜索结果的大量文档的方法和装置

    公开(公告)号:US20080086457A1

    公开(公告)日:2008-04-10

    申请号:US11847285

    申请日:2007-08-29

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: The present invention provides a method and apparatus for preprocessing a plurality of documents for search and presenting search result and a system for searching documents that comprises these apparatuses. Wherein the search result comprises at least one candidate document, and each of the at least one document is assigned a tree structure representing its content which comprises at least one node, said method for presenting search result comprising: presenting at least a portion of the tree structure corresponded to said at least one candidate document in said search result.

    摘要翻译: 本发明提供一种用于预处理用于搜索和呈现搜索结果的多个文档的方法和装置,以及用于搜索包括这些装置的文档的系统。 其中搜索结果包括至少一个候选文档,并且为所述至少一个文档中的每一个分配表示其内容的树结构,该树结构包括至少一个节点,所述用于呈现搜索结果的所述方法包括:呈现所述树的至少一部分 结构对应于所述搜索结果中的所述至少一个候选文档。

    Document processing method and system
    10.
    发明授权
    Document processing method and system 有权
    文件处理方法和系统

    公开(公告)号:US09043356B2

    公开(公告)日:2015-05-26

    申请号:US13608309

    申请日:2012-09-10

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30716 G06F17/30011

    摘要: A method and system for expanding a document set as a search data source in the field of business related search. The present invention provides a method of expanding a seed document in a seed document set. The method includes identifying one or more entity words of the seed document; identifying one or more topic words identifying one or more topic words related to a based entity word in the seed document where the entity word is located; forming an entity word-topic word pair from each identified topic word and the entity word on the basis of which each topic word is identified; and obtaining one or more expanded documents by taking the entity word and topic word in each entity word-topic word pair as key words for web searching at the same time. A system for executing the above method is also provided.

    摘要翻译: 一种在业务相关搜索领域中扩展作为搜索数据源的文档集的方法和系统。 本发明提供一种在种子文档集中扩展种子文档的方法。 该方法包括识别种子文档的一个或多个实体单词; 识别识别与所述实体单词所在的种子文档中的基于实体词相关的一个或多个主题词的一个或多个主题词; 从每个识别的主题词和实体单词形成实体单词对,并根据该单词识别每个主题词; 并通过将每个实体词主题词对中的实体单词和主题词作为用于网页搜索的关键词同时获得一个或多个扩展文档。 还提供了用于执行上述方法的系统。