OPTIMIZING REFERENCES TO CHANGING DATA SETS IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248905A1

    公开(公告)日:2024-07-25

    申请号:US18159667

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/254

    Abstract: References to changing data sets in distributed data lakes are optimized. As part of a transaction, a first message is received. The first message identifies a table and first data to be written to the table. Based on at least the table, the first message is routed to a first ingestion node of a plurality of ingestion nodes. The first data is persisted in temporary storage. Location information of the persisted first data is determined. A data available message comprising a self-describing reference to the first data is published, by the first ingestion node, to a first reader node of a plurality of reader nodes. The self-describing reference identifies the first ingestion node, the location information of the first data, and a range of the first data.

    OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248879A1

    公开(公告)日:2024-07-25

    申请号:US18159677

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/172 G06F16/122 G06F16/1724

    Abstract: Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.

Patent Agency Ranking