Abstract:
A system and method for autonomic data storage and movement for big data analytics. A cost, such as storing cost and a processing cost are calculated for received data. The processing type associated with the received data is determined in response to the calculated costs. The received data is classified as one of a set of hierarchical storage classes based upon the determined processing type. The hierarchical storage classes include no data store, memory, HDFS, database, disk archive, external clouds, and data removal. The received data is then stored in the storage location associated with that class. In the event that insufficient capacity is available in the location, the priority of the received data and the priority of previously stored data is determined and compared. The priority is calculated based on potential usage, privacy, estimated cost, frequency of usages and the age of data. The lower priority data is then moved to the next lower hierarchical class for storage.
Abstract:
The present invention generally relates to systems and methods for executing scripts (a sequence of declarative operations) on large data sets. Some implementations store descriptions of previously-executed operations and associated input and output data sets. When executing similar operations on the same, a subset of, a superset of, or any fragment of data subsequently, some implementations detect duplication of operations and access previously-stored output data sets in order to re-use data and reduce the amount of execution, thus avoiding time-consuming duplicative computations.
Abstract:
A method, non-transitory computer readable medium, and apparatus for adapting resources of the cluster of nodes for a real-time streaming workflow are disclosed. For example, the method receives a notification that a node of the cluster of nodes associated with an instance of a process of the real-time streaming workflow is predicted to be a bottleneck, identifies a number of hops to send a resource statement when the bottleneck is predicted that minimizes a ripple effect associated with transmitting the resource statement, transmits the resource statement to at least one or more nodes of the cluster of nodes within the number of hops, receives a response from one of the at least one or more nodes within the cluster of nodes and adapts a resource usage to the at least one of the one or more nodes within the cluster of nodes that the response was received from.
Abstract:
The present invention generally relates to systems and methods for executing scripts (a sequence of declarative operations) on large data sets. Some implementations store descriptions of previously-executed operations and associated input and output data sets. When executing similar operations on the same, a subset of, a superset of, or any fragment of data subsequently, some implementations detect duplication of operations and access previously-stored output data sets in order to re-use data and reduce the amount of execution, thus avoiding time-consuming duplicative computations.
Abstract:
A method, non-transitory computer readable medium, and apparatus for adapting resources of the cluster of nodes for a real-time streaming workflow are disclosed. For example, the method receives a notification that a node of the cluster of nodes associated with an instance of a process of the real-time streaming workflow is predicted to be a bottleneck, identifies a number of hops to send a resource statement when the bottleneck is predicted that minimizes a ripple effect associated with transmitting the resource statement, transmits the resource statement to at least one or more nodes of the cluster of nodes within the number of hops, receives a response from one of the at least one or more nodes within the cluster of nodes and adapts a resource usage to the at least one of the one or more nodes within the cluster of nodes that the response was received from.
Abstract:
A method, non-transitory computer readable medium, and apparatus for configuring a scheduling a job request in a data processing platform are disclosed. The method receives a new job request having a priority selected by a user, submits the new job request to an online job queue comprising a plurality of jobs, wherein each one of the plurality of jobs comprises a respective priority selected by a respective user and schedules the new job request and the plurality of jobs in the online job queue to one or more available worker nodes in a unit time slot based upon a comparison of the priority of the new job and the respective priority of the plurality of jobs in the online job queue, wherein the scheduling algorithm is based on one of: blocks having a variable size and a static processing time or blocks having a static size and a variable processing time.