摘要:
A method and system that operates as a background process automatically identify and merge duplicate files into a single instance files, wherein the duplicate files become independent links to the single instance files. A groveler maintains a database of information about the files on a volume, including a file size and checksum (signature) based on the file contents. The groveler periodically acts in the background to scan the USN log, a log that dynamically records file system activity. New or modified files detected in the USN log are queued as work items, each work item representing a file. The volume may be scanned to add work items to the queue, which takes place initially or when there is a potential problem with the USN log. The groveler periodically removes items from the queue, calculates the signature of the corresponding file contents, and uses the signature and file size to query the database for matching files. The groveler then compares any matching files with the file corresponding to the work item for an exact duplicate, and if found, calls a single instance store facility to merge the files and create independent links to those files.
摘要:
A computer method of online mining of quantitative association rules consisting of two stages, a preprocessing stage followed by an online rule generation stage. The required computational effort is reduced by the pre-processing stage, defined by pre-processing data to organize the relationship between antecedent attributes to create a heirarchially arranged multidimensional indexing structure. The resulting structure facilitates the performance of the second stage, online processing, which involves the generation of quantitative association rules. The second stage, online rule generation, utilizes the multidimensional index structure created by the preprocessing stage by first finding the areas in the data which correspond to the rules and then uses a merging step to create a merged tree in order to carefully combine interesting regions in order to give a heirarchical representation of the rule set. The merged tree is then used in order to actually generate the rules.
摘要:
A method and apparatus for mining text databases, employing sequential pattern phrase identification and shape queries, to discover trends. The method passes over a desired database using a dynamically generated shape query. Documents within the database are selected based on specific classifications and user defined partitions. Once a partition is specified, transaction IDs are assigned to the words in the text documents depending on their placement within each document. The transaction IDs encode both the position of each word within the document as well as representing sentence, paragraph, and section breaks, and are represented in one embodiment as long integers with the sentence boundaries. A maximum and minimum gap between words in the phrases and the minimum support all phrases must meet for the selected time period may be specified. A generalized sequential pattern method is used to generate those phrases in each partition that meet the minimum support threshold. The shape query engine takes the set of phrases for the partition of interest and selects those that match a given shape query. A query may take the form of requesting a trend such as "recent upwards trend", "recent spikes in usage", "downward trends", and "resurgence of usage". Once the phrases matching the shape query are found, they are presented to the user.
摘要:
A distributed storage system provides a method and apparatus for storing, retrieving, and sharing data items across multiple physical storage devices that may not always be connected with one another. The distributed storage system of the present invention comprises one or more `partitions` on distinct storage devices, with each partition comprising of a group of associated data files. Partitions can be of various types. Journal partitions may be written to by a user and contain the user's updates to shared files. In the preferred embodiment, journal partitions reside on a storage device associated with a client computer in a client-server architecture. Other types of partitions, library and archive partitions, may reside on storage devices associated with a server computer in a client-server architecture. The files on the journal partitions of the various clients may, at various times, be merged into a file resident within the library partition. If two or more clients attempt to update or alter data related to the same file, the system resolves the conflict between the clients to determine which updates, if any, should be stored in the library partition. The merge operation may occur at various time intervals or be event driven. The archive partition stores files from the library partition.
摘要:
A parallel processing approach for use in multiple hypothesis tracking applications that provides partitioning and load balancing to achieve greater processing efficiency. The present invention comprises a plurality of processors that are each coupled to a shared memory, and which communicate to a central database stored in the shared memory. The central database is organized as a collection of radar tracks. Radar data is supplied to the processors as an input data stream organized in terms of radar tracks. The parallel processors are configured so that the next available processor retrieves the next successive measurement data point from the input data stream, updates tracks in the database using each retrieved measurement data point, wherein all processors operate independently without external synchronization, partitions the database into noninteracting clusters, wherein partitioning is executed in parallel by the plurality of processors which operate independently without external synchronization, retrieves the next successive cluster, forms and selects hypotheses based on the retrieved cluster, and updates the database based on the selected hypotheses. The present invention achieves an efficient implementation of multiple hypothesis tracking to provide for real-time multiprocessing. Parallelization of non-interactive and interactive multiple hypothesis tracking functions is readily achieved using the present invention. A parallel processing method for use in multiple hypothesis tracking applications is also disclosed.
摘要:
A facility is provided for normalizing the format of stored data records using a dictionary that is generated from a training set of data records having predefined formats.