-
公开(公告)号:US20250013606A1
公开(公告)日:2025-01-09
申请号:US18218410
申请日:2023-07-05
Applicant: Databricks, Inc.
Inventor: Prakhar Jain , Frederick Ryan Johnson , Terry Kim , Vijayan Prabhakaran , Bart Samwel
Abstract: A data processing service generates a data classifier tree for managing data files of a data table. The data classifier tree may be configured as a KD-classifier tree and includes a plurality of nodes and edges. A node of the data classifier tree may represent a splitting condition with respect to key-values for a respective key. A node of the data classifier tree may be associated with one or more data files assigned to the node. The data files assigned to the node each include a subset of records having key-values that satisfy the conditions represented by the node and parent nodes of the node. The data processing service may efficiently cluster the data in the data table while reducing the number of data files that are rewritten when data is modified or added to the data table.
-
公开(公告)号:US20250013619A1
公开(公告)日:2025-01-09
申请号:US18218766
申请日:2023-07-06
Applicant: Databricks, Inc.
Inventor: Prakhar Jain , Frederick Ryan Johnson , Bart Samwel
IPC: G06F16/22 , G06F16/2453 , G06F16/28
Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.
-
公开(公告)号:US12072863B1
公开(公告)日:2024-08-27
申请号:US18218400
申请日:2023-07-05
Applicant: Databricks, Inc.
Inventor: Prakhar Jain , Frederick Ryan Johnson , Bart Samwel
IPC: G06F16/20 , G06F16/22 , G06F16/23 , G06F16/245 , G06F16/28
CPC classification number: G06F16/2246 , G06F16/2358 , G06F16/245 , G06F16/285
Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.
-
-