OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES

Invention Publication

US20240248879A1 OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES 审中-公开

Please log in to see more content

Patent Title: OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES
Application No.: US18159677

Application Date: 2023-01-25
Publication No.: US20240248879A1

Publication Date: 2024-07-25
Inventor: Dimiter DIMITRIEV , Kostadin GEORGIEV , Abhishek GUPTA , Christos KARAMANOLIS , Richard P. SPILLANE
Applicant: VMware, Inc.
Applicant Address: US CA Palo Alto
Assignee: VMware, Inc.
Current Assignee: VMware, Inc.
Current Assignee Address: US CA Palo Alto
Main IPC: G06F16/172
IPC: G06F16/172 ; G06F16/11 ; G06F16/17

OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES

Abstract:

Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/10	.•文件系统；文件服务器
G06F16/17	..••文件系统功能的进一步细节
G06F16/172	...•••缓存，预取或存储文件