Document data processing method and apparatus for document retrieval
    1.
    发明授权
    Document data processing method and apparatus for document retrieval 失效
    用于文件检索的文档数据处理方法和装置

    公开(公告)号:US5469354A

    公开(公告)日:1995-11-21

    申请号:US843162

    申请日:1992-02-28

    摘要: High-speed full document retrieval method and system capable of providing result of retrieval within practically acceptable short search time. Upon registration of documents in a document database, condensed texts are created by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species and by checking mutual inclusion relations existing among the fragmental character strings. A component character table is created in which characters occurring in each of the condensed texts are registered without duplication. The condensed texts and the component character table are registered in the data base together with the texts of the documents to be registered. Upon retrieval of a document containing a search term designated by a user, a component character table search is first executed to extract those documents which contain all species of characters constituting the search term by consulting the component character table, and subsequently a condensed text search is executed by consulting the condensed texts of the documents. Finally, a text body search is executed for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through the component character table search and the condensed text search.

    摘要翻译: 高速全文检索方法和系统能够在实际可接受的短时间内提供检索结果。 在文档数据库中注册文档时,通过根据字符种类将要注册的文档的每个文本字符串分解成分段字符串并通过检查分段字符串之间存在的相互包含关系来创建精简文本。 创建组件字符表,其中在每个精简文本中出现的字符都不重复地注册。 精简文本和组件字符表与要注册的文档的文本一起登记在数据库中。 在检索包含由用户指定的搜索词的文档时,首先执行组件字符表搜索,以通过查看组件字符表来提取包含构成搜索词的所有字符的字符的文档,随后,浓缩文本搜索是 通过咨询文件的精简文本执行。 最后,通过查阅通过组件字符表搜索和浓缩文本搜索提取的文档的文本,执行文本正文搜索以提取满足查询条件的查询条件的文档。

    Document search method wherein stored documents and search queries
comprise segmented text data of spaced, nonconsecutive text elements
and words segmented by predetermined symbols
    2.
    发明授权
    Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols 失效
    文档搜索方法,其中存储的文档和搜索查询包括间隔的非连续的文本元素和由预定符号分段的文本的分段文本数据

    公开(公告)号:US5748953A

    公开(公告)日:1998-05-05

    申请号:US444842

    申请日:1995-05-18

    IPC分类号: G06F17/30 G06K9/62 G06K9/72

    摘要: A neighboring plural-character occurrence bitmap of a practical capacity capable of eliminating noises by hashing is realized, and a high speed full text search is realized equivalently, by greatly reducing the number of documents to be searched even if a search term constituted by a combination of English characters and words is used. Text data is segmented into words, and n-character strings at every (m+l)-th character positions are extracted from each word. A neighboring plural-character occurrence bitmap is created which stores data representing a presence of each neighboring plural-character string at a certain entry thereof. N-character strings at every (m+l)-th character positions are extracted from a search term and the neighboring plural-character occurrence bitmap is searched by using a search control program. Since the neighboring plural-character occurrence bitmap is searched prior to searching condensed texts, documents not relevant to the search term can be discarded and a high speed full text search can be realized.

    摘要翻译: 实现能够通过散列消除噪声的实际容量的相邻多字符出现位图,并且通过大大减少要搜索的文档的数量,即使由组合构成的搜索项也大大减少了要进行高速全文检索 使用英文字和词。 文本数据被分割成单词,并且从每个单词提取每个(m + 1)个字符位置处的n个字符串。 创建相邻的多字符出现位图,其存储表示在其特定条目处的每个相邻多个字符串的存在的数据。 从搜索项提取每(m + 1)个字符位置处的N个字符串,并且通过使用搜索控制程序搜索相邻的多个字符出现位图。 由于在搜索浓缩文本之前搜索相邻的多个字符出现位图,所以可以丢弃与搜索项无关的文档,并且可以实现高速全文搜索。

    Text search method and apparatus for structured documents
    3.
    发明授权
    Text search method and apparatus for structured documents 失效
    文本搜索方法和结构化文档的设备

    公开(公告)号:US5745745A

    公开(公告)日:1998-04-28

    申请号:US495232

    申请日:1995-06-27

    IPC分类号: G06F17/30

    摘要: A text search method for structured documents and apparatus wherein structured document database to be searched is created by adding logical structure length information to logical structure discriminating information on the basis of a structured document constituted by a plurality of logical structures and logical structure discriminator information, and in searching the structured document database in accordance with an entered search key constituted by a logical structure and a search character string, the search of a logical structure other than the entered logical structure is skipped based upon the data length information. In the method and apparatus, by using a structured document, structured document database to be searched is created which is constituted by a condensed text for each logical structure constituted by a list of words contained in the logical structure, a character occurrence bitmap for each logical structure constituted by a list of characters contained in the logical structure and an source structured document, and then the structured document database is searched in accordance with an entered search key constituted by a logical structure and a search character string. An source structured document is searched optionally depending upon the entered search key.

    摘要翻译: 用于结构化文档和装置的文本搜索方法,其中通过基于由多个逻辑结构和逻辑结构鉴别器信息构成的结构化文档将逻辑结构长度信息添加到逻辑结构鉴别信息来创建要搜索的结构化文档数据库,以及 在根据由逻辑结构和搜索字符串构成的输入的搜索关键字搜索结构化文档数据库时,基于数据长度信息来跳过除输入的逻辑结构之外的逻辑结构的搜索。 在该方法和装置中,通过使用结构化文档,创建由被包含在逻辑结构中的单词列表构成的每个逻辑结构的精简文本构成的要搜索的结构化文档数据库,每个逻辑的字符出现位图 由包含在逻辑结构中的字符列表和源结构化文档构成的结构,然后根据由逻辑结构和搜索字符串构成的输入搜索关键字搜索结构化文档数据库。 根据输入的搜索关键字可选地搜索源结构化文档。

    System for character stream search using finite state automaton technique
    4.
    发明授权
    System for character stream search using finite state automaton technique 失效
    使用有限状态自动机技术的字符流搜索系统

    公开(公告)号:US5051886A

    公开(公告)日:1991-09-24

    申请号:US205923

    申请日:1988-06-13

    IPC分类号: G06F17/21 G06F17/30

    摘要: A character stream search system using an FSA for determining at a time whether or not a plurality of character streams as search objects exist in a search character stream which undergoes a search operation and which comprises a plurality of characters expressed with codes. In the system, a collation is conducted between the search character stream and a search object character. In a case where there exists a matched search object character as a result of the collation, a state transition is carried out of a predetermined state indicated by the FSA. In a case where there does not exist a matched search object character, a failure processing to effect a state transition to a transistion destination which is determined in association with the configuration of the FSA. The following processing is completed at a count which is a predetermined upper-limit value for each character undergone the search operation.

    摘要翻译: 一种使用FSA的字符流搜索系统,用于一次确定在经历搜索操作的搜索字符流中是否存在作为搜索对象的多个字符流,并且包括用代码表示的多个字符。 在系统中,在搜索字符流和搜索对象字符之间进行归类。 在作为对照的结果存在匹配的搜索对象字符的情况下,由FSA指示的预定状态执行状态转换。 在不存在匹配的搜索对象字符的情况下,执行与FSA的配置相关联地确定的转移目的地的状态转换的失败处理。 以对于每个经过搜索操作的字符的预定上限值的计数完成以下处理。

    System for plural-string search with a parallel collation of a first
partition of each string followed by finite automata matching of second
partitions
    6.
    发明授权
    System for plural-string search with a parallel collation of a first partition of each string followed by finite automata matching of second partitions 失效
    用于多字符串搜索的系统,其中每个字符串的第一分区的并行排序以及第二分区的有限自动机匹配

    公开(公告)号:US5452451A

    公开(公告)日:1995-09-19

    申请号:US349124

    申请日:1994-12-01

    IPC分类号: G06F17/30

    摘要: A parallel comparator for performing a parallel and high-speed processing for collation of partial character strings which are partially taken out of a plurality of character strings of interest to be searched out with a character string to be searched in which document data to be searched is arranged sequentially from a leading character, is provided in a front stage of an automaton executing device. Only when a part of the character string to be searched coincides with the partial character string set in the comparator, the collation of the remaining portion of the character string to be searched is performed by the automaton executing device. Also, it is possible to set "don't care" in which a character at any position in the partial character string is ignored at the time of comparison by the comparator and to set a negation condition in which the comparison by the comparator is made taking the negation of a character at any position in the partial character string.

    摘要翻译: 一种并行比较器,用于执行并行和高速处理,用于将要搜索的多个感兴趣的字符串部分地从要搜索的文档数据中搜索的字符串中部分取出的部分字符串对准, 设置在自动机执行装置的前级中,从主角排列顺序排列。 只有当要搜索的字符串的一部分与比较器中设置的部分字符串一致时,由自动机执行装置执行要搜索的字符串的剩余部分的核对。 此外,可以设置在比较器比较时忽略部分字符串中的任何位置的字符的“无关心”,并且设置比较器进行比较的否定条件 在部分字符串中的任何位置取一个字符。

    Document retrieval method and system
    7.
    发明授权
    Document retrieval method and system 失效
    文件检索方法和系统

    公开(公告)号:US5757983A

    公开(公告)日:1998-05-26

    申请号:US517722

    申请日:1995-08-21

    摘要: A document retrieval method and system for retrieving, from a document database storing document data in the form of character codes, a document which contains given search terms and which meets a given search query condition. From documents loaded from the document database, a document containing terms which match the search terms is searched to generate document identification (ID) information including a document identifier of the searched document and containing match terms found to match with the search terms as well as term identifiers of the match terms and position information of the match terms in the searched document. A decision is then made as to whether or not the position information of the match terms satisfies a positional condition specified in the search query condition concerning a positional relation between the search terms, and match information is then generated indicating satisfaction of the search query condition when the positional condition is satisfied. Through a proximity condition decision, it is ascertained whether the match terms satisfy an inter-term distance condition specified in the search query condition. Through a contextual condition decision, it is determined whether the match terms satisfy a concurrence condition specifying concurrence of the search terms in a same sub-sentence, a same sentence or a same paragraph. Through a logical condition, it is decided whether the match terms satisfy a logical condition between the search terms specified in the search query condition.

    摘要翻译: 一种文档检索方法和系统,用于从存储文字数据形式的文档数据的文档数据库中检索包含给定搜索词并且满足给定搜索查询条件的文档。 从文档数据库中加载的文档中,搜索包含与搜索词匹配的术语的文档,以生成包括所搜索文档的文档标识符的文档标识(ID)信息,并且包含与搜索词匹配的匹配项,以及术语 搜索文档中匹配项的匹配项和位置信息的标识符。 然后作出关于匹配项的位置信息是否满足关于搜索项之间的位置关系的搜索查询条件中指定的位置条件的决定,然后生成表示搜索查询条件的满足的匹配信息, 满足位置条件。 通过接近度条件判定,确定匹配项是否满足在搜索查询条件中指定的期间距离条件。 通过上下文条件决定,确定匹配项是否满足同一子句,同一句或同一段中的搜索项的同意的同意条件。 通过逻辑条件,确定匹配项是否满足在搜索查询条件中指定的搜索项之间的逻辑条件。

    Character stream search apparatus using a finite state automation
    10.
    发明授权
    Character stream search apparatus using a finite state automation 失效
    使用有限状态自动化的字符流搜索装置

    公开(公告)号:US5278981A

    公开(公告)日:1994-01-11

    申请号:US761442

    申请日:1991-09-18

    摘要: A character stream search system using an FSA for determining at a time whether or not a plurality of character streams as search objects exist in a search character stream which undergoes a search operation and which comprises a plurality of characters expressed with codes. In the system, a collation is conducted between the search character stream and a search object character. In a case where there exists a matched search object character as a result of the collation, a state transition is carried out to a predetermined state indicated by the FSA. In a case where there does not exist a matched search object character, a failure processing to effect a state transition to a transition destination which is determined in association with the configuration of the FSA. The failure processing is completed at a count which is a predetermined upper-limit value for each character undergone the search operation.

    摘要翻译: 一种使用FSA的字符流搜索系统,用于一次确定在经历搜索操作的搜索字符流中是否存在作为搜索对象的多个字符流,并且包括用代码表示的多个字符。 在系统中,在搜索字符流和搜索对象字符之间进行归类。 在作为对照的结果存在匹配的搜索对象字符的情况下,状态转换被执行到由FSA指示的预定状态。 在不存在匹配的搜索对象字符的情况下,执行与FSA的配置相关联地确定的转换目的地的状态转换的故障处理。 以对于每个经过搜索操作的字符的预定上限值的计数完成故障处理。