Abstract:
System and methods manage concurrent ETL processes accessing a database. Exemplary embodiments include a method for concurrency management for ETL processes in a database having database tables and communicatively coupled to a computer, the method including establishing a session lock for the database, determining that a current ETL process is accessing the database at a current time, associating a current expiration time with the session lock, the expiration time being stored in a lock table in the database, sending the session lock to the current ETL process and performing ETL-level locking for the current ETL process.
Abstract:
A method for use with an information (or data) warehouse comprises managing the information warehouse with instructions in a declarative language. The instructions specify information warehouse-level tasks to be done without specifying certain details of how the tasks are to be implemented, for example, using databases and text indexers. The details are hidden from the user and include, for example, in an information warehouse having a FACT table that joins two or more dimension tables, details of database level operations when structured data are being handled, including database command line utilities, database drivers, and structured query language (SQL) statements; and details of text-indexing engines when unstructured data are being handled. The information warehouse is managed in a dynamic way in which different tasks—such as data loading tasks and information warehouse construction tasks—may be interleaved (i.e., there is no particular order in which the different tasks must be completed).
Abstract:
A method, system, and computer program product are disclosed. Exemplary embodiments of the method, system, and computer program product may include hardware, process steps, and computer program instructions for supporting versioning in a data warehouse. The data warehouse may include a data warehouse engine for creating a data warehouse including a fact table and temporary tables. Updated or new data records may be transferred into the data warehouse and bulk loaded into the temporary tables. The updated or new data records may be evaluated for attributes matching existing data records. A version number may be assigned to data records and data records may be marked as being the most current version. Updated and new data records may be bulk loaded from the temporary tables into the fact table when a version number or a version status is calculated.
Abstract:
A system and method for ensuring large and frequent updates to a data warehouse. The process leverages a set of temporary staging tables to track the updates. A set of intermediate steps are performed to accomplish bulk deletions of the outdated changed records, and perform modifications to the map tables for models such as snowflake. Finally, bulk load operations load the updates and insert them into the final dimension tables. The process ensures performance comparable to insertion-only schemes with at most only slight performance degradation. Furthermore, a modified process is applied on the newfact data warehouse dimension model. The process can be readily adapted to handle star schema and other hierarchical data warehouse models.
Abstract:
A method of data loading for large information warehouses includes performing checkpointing concurrently with data loading into an information warehouse, the checkpointing ensuring consistency among multiple tables; and recovering from a failure in the data loading using the checkpointing. A method is also disclosed for performing versioning concurrently with data loading into an information warehouse. The versioning method enables processing undo and redo operations of the data loading between a later version and a previous version. Data load failure recovery is performed without starting a data load from the beginning but rather from a latest checkpoint for data loading at an information warehouse level using a checkpoint process characterized by a state transition diagram having a multiplicity of states; and tracking state transitions among the states using a system state table.
Abstract:
Techniques for reducing a number of computations in a data storage process are provided. One or more computational elements are identified in the data storage process. An ordered structure of one or more nodes is generated using the one or more computational elements. Each of the one or more nodes represents one or more computational elements. Further, a weight is assigned to each of the one or more nodes. An ordered structure of one or more reusable nodes is generated by deleting one or more nodes in accordance with the assigned weights. The ordered structure of one or more reusable nodes is utilized to reduce the number of computations in the data storage process. The data storage process converts data from a first format into a second format, and stores the data in the second format on a computer readable medium for data analysis purposes.
Abstract:
Data may be modeled as an undirected graph. A set of entities and a set of attributes may be defined. A set of relationships may be defined to represent semantic associations with each association connecting at least two entities. Attributes may be associated with entities rather than with relationships. A hierarchical query language with a set of atomic operations on modeled data may be employed. The modeled data may be displayed on a display unit.
Abstract:
A method for use with an aggregation operation (e.g., on a relational database table) includes a sorting pass and a merging pass. The sorting pass includes: (a) reading blocks of the table from a storage medium into a memory using an aggregation method until the memory is substantially full or until all the data have been read into the memory; (b) determining a number k of blocks to write back to the storage medium from the memory; (c) selecting k blocks from memory, sorting the k blocks, and then writing the k blocks back to the storage medium as a new sublist; and (d) repeating steps (a), (b), and (c) for any unprocessed tuples in the database table. The merging pass includes: merging all the sublists to form an aggregation result using a merge-sort algorithm.
Abstract:
One embodiment is a computer-implemented method for classifying documents in a collection of documents according to their intended readerships. The method comprises using a computer to select a document in the collection of documents; and using a computer to determine a characteristic of the selected document, the characteristic being: misleading when the document includes one or more features that are determined to be for a purpose other than reading the document; commercial when the document includes features that are presented for a commercial purpose; or personal when the document includes features of a personal opinion. The method further includes using a computer to classify the selected document as misleading, commercial, or personal according to its determined characteristic; and using a computer to repeat the steps of select document, determine a characteristic of the selected document, and classify the selected document for additional documents in the collection. At least some documents are classified as misleading, at least some documents are classified as commercial, and at least some documents are classified as personal. Other methods and computer program products are also disclosed according to even more embodiments.
Abstract:
A method for use with an aggregation operation (e.g., on a relational database table) includes a sorting pass and a merging pass. The sorting pass includes: (a) reading blocks of the table from a storage medium into a memory using an aggregation method until the memory is substantially full or until all the data have been read into the memory; (b) determining a number k of blocks to write back to the storage medium from the memory; (c) selecting k blocks from memory, sorting the k blocks, and then writing the k blocks back to the storage medium as a new sublist; and (d) repeating steps (a), (b), and (c) for any unprocessed tuples in the database table. The merging pass includes: merging all the sublists to form an aggregation result using a merge-sort algorithm.