摘要:
Data processing apparatus comprising: a chunk store containing specimen data chunks, a manifest store containing a plurality of manifests, each of which represents at least a part of a data set and each of which comprises at least one reference to at least one of said specimen data chunks, a sparse chunk index containing information on only some specimen data chunks, the processor being operable to: process input data into input data chunks; identify manifests having at least one reference to one of said specimen data chunks that corresponds to one of said input data chunks and on which there is information contained in the sparse chunk index; and prioritize the identified manifests for subsequent operation.
摘要:
A system and method for data cache management are provided in which a request for access to data is, and a sample value is assigned to the request, the sample value being randomly selected according to a probability distribution. The sample value is compared to another value such as a previously stored sample value, and the data is selectively stored in the cache based on results of the comparison. If the requested data is not in the cache, the sample value may be compared with an extreme one of a plurality of sampled values such as the lowest sampled value. Each of the sampled values may be stored in a database, and the sampled values or the probability distribution may be changed over time to account for frequency of requests.
摘要:
A particular data value is represented as a group of segments stored in corresponding entries of a data structure. Additional data values represented by corresponding groups of segments are written into the data structure. A probability of overwriting segments representing the particular data value increases as a number of the additional data values increase. A correct version of the particular data value is retrieved even though one or more segments representing the particular data value has been overwritten.
摘要:
A method of identifying a fresh document in a document set is provided. The method may include obtaining a query document that is included in a document set comprising a plurality of documents. The method may also include grouping the plurality of documents into a plurality of fine clusters based on a textual similarity between the plurality of documents. The method may also include identifying a target fine cluster within the plurality of fine clusters, the target fine cluster including the query document. The method may also include ordering the documents included in the target fine cluster by time to identify the fresh document. The method may also include generating a query response that includes the fresh document.
摘要:
A plurality of differential data stores are stored in persistent storage media. In response to receiving a first request to store a particular data object, one of the differential data stores that are stored in the persistent storage media is selected, wherein selecting the one differential data store is according to a criterion relating to compression of data objects in the differential data stores. The selected differential data store is copied into temporary storage media, where the copying is not delayed after receiving the first request to await receipt of more requests. The particular data object is inserted into the copy of the selected differential data store in the temporary storage media, where the inserting is performed without having to retrieve more data from the selected differential store in the persistent storage media. The selected differential data store in the persistent storage media is replaced with the copy of the selected differential data store in the temporary storage media that has been modified.
摘要:
Data processing apparatus comprising: a chunk store containing specimen data chunks, a manifest store containing at least one manifest that represents at least a part of a data set and that comprises at least one reference to at least one of said specimen data chunks, a sparse chunk index containing information on only those specimen data chunks having a predetermined characteristic, the processing apparatus being operable to process input data into input data chunks and to use the sparse chunk index to identify at least one of said at least one manifest that includes at least one reference to one of said specimen data chunks that corresponds to one of said input data chunks having the predetermined characteristic.
摘要:
Embodiments of the present invention pertain to determining an approximate number of instances of an item for an organization. According to one embodiment, instances of items that reside on computer systems associated with the organization are determined. Instances of the same item can reside on different computers and an identification uniquely identifies an item. Random numbers are associated with identifications of the items. An approximate number of instances of the item is determined based on a highest random number associated with the item. The highest random number is the highest of the random numbers that were generated for the instances of the item.
摘要:
A method controls the operation of devices which communicate over a wireless communications channel. The method includes determining a parameter of a received signal communicated over the wireless communications channel and determining a minimum threshold value of the received signal. An average duration of fade is determined using the parameter and the minimum threshold. The method detects whether the received signal is less than the minimum threshold value. At least one of the devices is placed in a sleep mode for approximately the average duration of fade in response to the received signal being detected as less than the minimum threshold value. The determined parameter of the received signal may be the root mean square value of the received signal.
摘要:
One embodiment is a data processing apparatus that has a chunk store containing specimen data chunks, a manifest store containing a plurality of manifests, each of which represents at least a part of previously processed data and includes at least one reference to at least one of the specimen data chunks, and a sparse chunk index containing information on only some specimen data chunks. Input data is processed into a plurality of input data segments. Each manifest of the first set has at least one reference to one of said specimen data chunks that corresponds to one of the input data chunks of a first input data segment. Specimen data chunks corresponding to other input data chunks of the first input data segment are identified by using the identified first set of manifests and at least one manifest identified when processing previous data.
摘要:
A method (200) of identifying a principal document in a document set is provided. An exemplary method includes obtaining a document set comprising a plurality of documents (202) and grouping the plurality of documents into a plurality of clusters based, at least in part, on a textual similarity between each of the plurality of documents (204). The method also includes obtaining one or more descriptive terms corresponding to the plurality of documents, wherein the descriptive terms are terms within the plurality of documents that have been identified as being useful for discriminating between the clusters (206). The method also includes, for each cluster, identifying a subset of descriptive terms based, at least in part, on a prevalence of the descriptive terms within the documents of the cluster (208) and identifying the principal documents in the cluster based, at least in part, on a prevalence of the subset of descriptive terms within each of the documents in the cluster (210).