MANAGING PROVENANCE INFORMATION FOR DATA PROCESSING PIPELINES
摘要:
A method for managing provenance information associated to one or more interconnected provenance entities in a provenance system for data processing pipelines in a distributed cloud environment over a network interface, wherein each of the data processing pipelines is configured to read in data, transform the data, and output transformed data is disclosed. The method comprises steps being performed by a configuration component of obtaining at least one declarative intent representing a configuration indicative of requirements and levels of priority for storage of provenance information for each of the data processing pipelines, deriving the requirements and levels of priority for storage of provenance information for each of the data processing pipelines based on the obtained at least one declarative intent, wherein one of the levels of priority—first level of priority—is higher than the other levels of priority—second levels of priority, estimating storage capacity for storage of provenance information in the provenance system based on the derived requirements and levels of priority, storing the provenance information according to the derived requirements and levels of priority for storage of provenance information and for each of the data processing pipelines, and when actual storage consumption for storage of provenance information in the provenance system meets a threshold of storage capacity set based on the estimated storage capacity: reducing a data amount for storage of provenance information of the second levels of priority in the provenance system. Corresponding computer program product, arrangement, configuration component, and system are also disclosed.
公开/授权文献
信息查询
0/0