-
1.
公开(公告)号:US12229169B1
公开(公告)日:2025-02-18
申请号:US18501830
申请日:2023-11-03
Applicant: Databricks, Inc.
Inventor: Terry Kim , Lin Ma , Rahul Shivu Mahadev , Rahul Potharaju
Abstract: The disclosed configurations provide a method (and/or a computer-readable medium or system) for determining, from a table schema describing keys of a data table, one or more clustering keys that can be used to cluster data files of a data table. The method includes generating features for the data table, generating tokens from the features, generating a prediction for each token by applying to the token a machine-learned transformer model trained to predict a likelihood that the key associated with the token is a clustering key for the data table, determining clustering keys based on the predictions, and clustering data records of the data table into data files based on key-values for the clustering keys.
-
公开(公告)号:US20240378181A1
公开(公告)日:2024-11-14
申请号:US18144647
申请日:2023-05-08
Applicant: Databricks, Inc.
Inventor: Vijayan Prabhakaran , Himanshu Raja , Rahul Potharaju , Naga Raju Bhanoori , Lin Ma , Rajesh Parangi Sharabhalingappa , Jintian Liang , Zach Schuermann , Kam Cheung Ting
Abstract: Disclosed is a configuration for managing the organization of data tables in cloud-based storage. The configuration receives metrics for data processing operations on the data table. Metrics include at least one of a size of the data table, a size of each file in the data table, and metadata describing the data table. The configuration automatically executes a cost-benefit analysis based on the one or more metrics for each candidate maintenance operation in a plurality of candidate maintenance operations. The configuration automatically selects a maintenance operation from the candidate maintenance operations to automate based on the cost-benefit analysis of the one or more candidate maintenance operations. The selected maintenance operation is automated and scheduled on the data table.
-
公开(公告)号:US12204510B2
公开(公告)日:2025-01-21
申请号:US18144647
申请日:2023-05-08
Applicant: Databricks, Inc.
Inventor: Vijayan Prabhakaran , Himanshu Raja , Rahul Potharaju , Naga Raju Bhanoori , Lin Ma , Rajesh Parangi Sharabhalingappa , Jintian Liang , Zachary Vaughn Schuermann , Kam Cheung Ting
Abstract: Disclosed is a configuration for managing the organization of data tables in cloud-based storage. The configuration receives metrics for data processing operations on the data table. Metrics include at least one of a size of the data table, a size of each file in the data table, and metadata describing the data table. The configuration automatically executes a cost-benefit analysis based on the one or more metrics for each candidate maintenance operation in a plurality of candidate maintenance operations. The configuration automatically selects a maintenance operation from the candidate maintenance operations to automate based on the cost-benefit analysis of the one or more candidate maintenance operations. The selected maintenance operation is automated and scheduled on the data table.
-
-