Invention Publication
- Patent Title: OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES
-
Application No.: US18159677Application Date: 2023-01-25
-
Publication No.: US20240248879A1Publication Date: 2024-07-25
- Inventor: Dimiter DIMITRIEV , Kostadin GEORGIEV , Abhishek GUPTA , Christos KARAMANOLIS , Richard P. SPILLANE
- Applicant: VMware, Inc.
- Applicant Address: US CA Palo Alto
- Assignee: VMware, Inc.
- Current Assignee: VMware, Inc.
- Current Assignee Address: US CA Palo Alto
- Main IPC: G06F16/172
- IPC: G06F16/172 ; G06F16/11 ; G06F16/17

Abstract:
Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.
Information query