-
公开(公告)号:US20240248879A1
公开(公告)日:2024-07-25
申请号:US18159677
申请日:2023-01-25
Applicant: VMware, Inc.
Inventor: Dimiter DIMITRIEV , Kostadin GEORGIEV , Abhishek GUPTA , Christos KARAMANOLIS , Richard P. SPILLANE
IPC: G06F16/172 , G06F16/11 , G06F16/17
CPC classification number: G06F16/172 , G06F16/122 , G06F16/1724
Abstract: Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.
-
公开(公告)号:US20240248905A1
公开(公告)日:2024-07-25
申请号:US18159667
申请日:2023-01-25
Applicant: VMware, Inc.
Inventor: Dimiter DIMITRIEV , Kostadin GEORGIEV , Abhishek GUPTA , Christos KARAMANOLIS , Richard P. SPILLANE
IPC: G06F16/25
CPC classification number: G06F16/254
Abstract: References to changing data sets in distributed data lakes are optimized. As part of a transaction, a first message is received. The first message identifies a table and first data to be written to the table. Based on at least the table, the first message is routed to a first ingestion node of a plurality of ingestion nodes. The first data is persisted in temporary storage. Location information of the persisted first data is determined. A data available message comprising a self-describing reference to the first data is published, by the first ingestion node, to a first reader node of a plurality of reader nodes. The self-describing reference identifies the first ingestion node, the location information of the first data, and a range of the first data.
-