摘要:
A multi-user search system with methodology for personal searching. In one embodiment, for example, a system for personal searching includes a plurality of index servers storing a plurality of index shards. Each index shard of the plurality of index shards indexes a plurality of documents. Each document of the plurality of documents belongs to one of a plurality of document namespaces assigned to the index shard. The system further includes a front-end server computer for receiving a search query from an authenticated user; an access control server for determining an authorized document namespace the authenticated user is authorized to access; and a query processor for answering the search query and restricting, based on an identifier of the authorized document namespace, an answer to the search query to identifying only documents satisfying the search query and belonging to the authorized document namespace.
摘要:
Disclosed is a method for updating an inverted index of a flash solid state disk (SSD). The method including: storing postings of a term that is present in only an in-memory inverted index in a block of an output buffer and reading postings of a last block of each posting list to be updated from an on-disk inverted index to be stored in each block of an input buffer, by scanning the on-disk inverted index and the in-memory inverted index; moving postings of the input buffer to the blocks of the output buffer for each block and attaching new postings of the in-memory inverted index to the block corresponding to the output buffer; and updating the on-disk inverted index by using the postings of each block of the output buffer.
摘要:
A system includes circuitry configured to: read a plurality of character information and a plurality of identifiers that are included in a text file; determine whether a character information among the plurality of character information is included between the at least one pair of identifiers among the plurality of identifiers in the text file; and associate the character information with the at least one pair of identifiers when it is determined that the character information is included between the at least one pair of identifiers.
摘要:
Methods and systems to build and utilize a search infrastructure are described. The system generates index information components in real-time based on a database that is time-stamped. The system updates index information at a plurality of query node servers based on the index information components. A query engine receives a search query from a client machine and identifies search results based on the query and the index information. The system communicates the search results, over the network, to the client machine.
摘要:
Entity mappings that produce matching entities for a first data asset having attributes and a second data asset having attributes are generated by: generating entity mappings that produce matching entities for a first data asset having attributes with attribute values and a second data asset having attributes with attribute values by: matching the attribute values of the attributes of the first data asset with the attribute values of the attributes of the second data asset, using the matching attribute values to generate matching attribute pairs, and using the matching attribute pairs to identify entity mappings; computing an entity mapping score for each of the entity mappings based on a combination of factors; ranking the entity mappings based on each entity mapping score; and using some of the ranked entity mappings to determine whether a same real-world entity is described by the first data asset and the second data asset.
摘要:
Storing text samples in a manner that the text samples may be quickly searched. The text samples are assigned a text sample identifier and are each parsed to thereby extract text components from the text samples. Text components that have the same content are assigned the same text component identifier. For each parsed text component, a text component entry is created that includes the assigned text component identifier as well as the text sample identifier for the text sample from which the text component was parsed. A text sample entry group is created for each text sample that contains the text component entries in sequence for the text components found within the text sample. The text sample entry groups are stored so as to be scannable during a future search.
摘要:
In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding an index, and a query may be answered with assurances that the results are accurate or within a threshold of accuracy.
摘要:
In one embodiment, a data structure comprises: a primary index comprising one or more position-block references; and one or more position blocks sequentially following the primary index, wherein: each one of the position-block references corresponds to one of the position blocks; and each one of the position blocks comprises: a secondary index comprising one or more position-data references; and one or more sets of positions sequentially following the secondary index, wherein each one of the position-data references corresponds to of one of the sets of positions in the position block. In one embodiment, an instance of the data structure is stored in a computer-readable memory and accessible by an application executed by a process.
摘要:
A computer-readable recording medium stores a program causing a computer to execute an information generating process that includes tabulating an appearance frequency for each designated word in an object file group in which character strings are described; identifying for each designated word and based on the appearance frequency tabulated for the designated word, a rank in descending order up to a target appearance rate for the designated words; detecting in an object file selected from the object file group, specific designated words among the identified ranks; and generating for each of the detected specific designated words, index information that indicates the presence/absence of the specific designated word in each object file among the object file group.
摘要:
Electronic files are selectively assigned to a plurality of different indexing queues by one or more dynamic throughput threshold gates based on characteristics of the different indexing queues as well as the static file characteristics associated with each of the files. The files are then indexed. Upon detecting a change in a dynamic characteristic of one or more indexed files, the throughput threshold gate(s) are then modified to obtain, maintain or modify a desired throughput for one or more of the indexing queues.