OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248879A1

    公开(公告)日:2024-07-25

    申请号:US18159677

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/172 G06F16/122 G06F16/1724

    Abstract: Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.

    OPTIMIZING REFERENCES TO CHANGING DATA SETS IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248905A1

    公开(公告)日:2024-07-25

    申请号:US18159667

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/254

    Abstract: References to changing data sets in distributed data lakes are optimized. As part of a transaction, a first message is received. The first message identifies a table and first data to be written to the table. Based on at least the table, the first message is routed to a first ingestion node of a plurality of ingestion nodes. The first data is persisted in temporary storage. Location information of the persisted first data is determined. A data available message comprising a self-describing reference to the first data is published, by the first ingestion node, to a first reader node of a plurality of reader nodes. The self-describing reference identifies the first ingestion node, the location information of the first data, and a range of the first data.

Patent Agency Ranking