-
公开(公告)号:US09990230B1
公开(公告)日:2018-06-05
申请号:US15052204
申请日:2016-02-24
Applicant: Databricks Inc.
Inventor: Ion Stoica , Yandong Mao , Eric Liang
CPC classification number: G06F9/4887
Abstract: A system for scheduling a notebook execution includes an interface and a processor. The interface is to receive an indication to schedule a notebook for execution, wherein the indication comprises a scheduled time and a cluster. The processor is to determine whether it is the scheduled time; and in the event that it is the scheduled time: determine whether the cluster is running; and in the event that the cluster is not running, set up the cluster and cause the notebook to execute using the cluster.
-
公开(公告)号:US09836302B1
公开(公告)日:2017-12-05
申请号:US15010845
申请日:2016-01-29
Applicant: Databricks Inc.
Inventor: Timothee Hunter , Ali Ghodsi , Ion Stoica
CPC classification number: G06F8/71 , G06F8/54 , G06F9/445 , G06F9/45512 , G06F9/5066 , G06F17/30867
Abstract: A system for processing a notebook includes an input interface and a processor. The input interface is to receive a first notebook. The notebook comprises code for interactively querying and viewing data. The processor is to load the first notebook into a shell. The shell receives one or more parameters associated with the first notebook. The shell executes the first notebook using a cluster.
-
公开(公告)号:US20250021536A1
公开(公告)日:2025-01-16
申请号:US18885322
申请日:2024-09-13
Applicant: Databricks, Inc.
Inventor: Aaron Daniel Davidson , Clemens Mewald , Tomas Nykodym
IPC: G06F16/21 , G06F16/955 , G06N5/022
Abstract: A system includes an interface, a processor, and a memory. The interface is configured to receive a version of a model from a model registry. The processor is configured to store the version of the model, start a process running the version of the model, and update a proxy with version information associated with the version of the model, wherein the updated proxy indicates to redirect an indication to invoke the version of the model to the process. The memory is coupled to the processor and configured to provide the processor with instructions.
-
公开(公告)号:US20250013619A1
公开(公告)日:2025-01-09
申请号:US18218766
申请日:2023-07-06
Applicant: Databricks, Inc.
Inventor: Prakhar Jain , Frederick Ryan Johnson , Bart Samwel
IPC: G06F16/22 , G06F16/2453 , G06F16/28
Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.
-
公开(公告)号:US12153558B1
公开(公告)日:2024-11-26
申请号:US18162093
申请日:2023-01-31
Applicant: Databricks, Inc.
Inventor: Alexander Behm , Ankur Dave
IPC: G06F16/00 , G06F16/13 , G06F16/22 , G06F16/242 , G06F16/2455 , G06F16/28
Abstract: A system includes a plurality of computing units. A first computing unit of the plurality of computing units comprises: a communication interface configured to receive an indication to roll up data in a data table; and a processor coupled to the communication interface and configured to: build a preaggregation hash table based at least in part on a set of columns and the data table by aggregating input rows of the data table; for each preaggregated hash table entry of the preaggregated hash table: provide the preaggregated hash table entry to a second computing unit of the plurality of computing units based at least in part on a distribution hash value; receive a set of received entries from computing units of the plurality of computing units; and build an aggregation hash table based at least in part on the set of received entries by aggregating the set of received entries.
-
公开(公告)号:US12105690B1
公开(公告)日:2024-10-01
申请号:US17875176
申请日:2022-07-27
Applicant: Databricks Inc.
Inventor: Timothy Armstrong , Arvind Sai Krishnan , Khayyam Guliyev
IPC: G06F16/00 , G06F16/22 , G06F16/2455
CPC classification number: G06F16/2246 , G06F16/24552
Abstract: A system for multipass sort includes a communication interface and a processor. The communication interface is configured to receive from a client device a request to sort a dataset that includes a plurality of rows. The processor is configured to perform a first sort pass on the dataset in part by: extracting prefixes associated with a first schema element associated with the dataset for the plurality of rows; and sorting the extracted prefixes utilizing an integer sort algorithm based on a sort order included in the request to sort the dataset, where sorting the extracted prefixes includes utilizing NULL values to resolve a tied range that includes at least two rows of the plurality of rows having a same extracted prefix.
-
公开(公告)号:US20240265010A1
公开(公告)日:2024-08-08
申请号:US18221735
申请日:2023-07-13
Applicant: Databricks, Inc.
Inventor: Saksham Garg , Bogdan Ionut Ghit , Christopher Stevens , Christian Stuart
IPC: G06F16/2453 , G06F16/25 , G06F16/28
CPC classification number: G06F16/24539 , G06F16/24542 , G06F16/256 , G06F16/285
Abstract: A multi-cluster computing system which includes a query result caching system is presented. The multi-cluster computing system may include a data processing service and client devices communicatively coupled over a network. The data processing service may include a control layer and a data layer. The control layer may be configured to receive and process requests from the client devices and manage resources in the data layer. The data layer may be configured to include instances of clusters of computing resources for executing jobs. The data layer may include a data storage system, which further includes a remote query result cache Store. The query result cache store may include a cloud storage query result cache which stores data associated with results of previously executed requests. As such, when a cluster encounters a previously executed request, the cluster may efficiently retrieve the cached result of the request from the in-memory query result cache or the cloud storage query result cache.
-
公开(公告)号:US20240256539A1
公开(公告)日:2024-08-01
申请号:US18160850
申请日:2023-01-27
Applicant: Databricks, Inc.
Inventor: Shoumik Palkar , Alexander Behm , Mostafa Mokhtar , Sriram Krishnamurthy
IPC: G06F16/2453 , G06F16/22
CPC classification number: G06F16/24539 , G06F16/221
Abstract: Disclosed herein is a method for determining whether to apply a lazy materialization technique to a query run. The method includes receiving a request to perform a new query in a columnar database containing a plurality of columns. A step in the method includes accessing a set of data in a column of the plurality of columns based on the query. The method includes generating an input to a machine-learned model comprising characteristics of the set of data in the column. From the machine-learned model, the method includes generating a likelihood value indicative of whether a filter of a first portion of the set of data in the column has greater efficiency than a download followed by a filter of the set of data in the column. The method further includes comparing the likelihood value to a threshold value. Based on the comparison, the method includes filtering the first portion of the set of data before downloading the set of data if the likelihood value is equal to or above the threshold value.
-
公开(公告)号:US12045220B2
公开(公告)日:2024-07-23
申请号:US17895890
申请日:2022-08-25
Applicant: Databricks, Inc.
Inventor: Bart Samwel , Tathagata Das , Lars Kroll , Yijia Cui , Juliusz Sompolski , Chirstos Stavrakakis
CPC classification number: G06F16/2282 , G06F9/4881
Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, and persist, in one or more deletion vector files, one or more deletion vectors for corresponding rows of the one or more target table files, and obtaining a resulting table based at least in part on the second job resulting file(s). Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and one or more deletion of vectors associated with previously removed rows of the matching target table files and obtaining the second job resulting file(s).
-
50.
公开(公告)号:US20240152338A1
公开(公告)日:2024-05-09
申请号:US18501839
申请日:2023-11-03
Applicant: Databricks, Inc.
Inventor: Desmond Cheong Zhi Xi , Menelaos Karavelas
IPC: G06F8/41
CPC classification number: G06F8/452
Abstract: A data processing service generates for iteratively applying a geospatial function to geospatial data. The generated code includes at least a first iterative loop and a second iterative loop. The data processing service compiles the generated code to generate compiled code that vectorized at least the second iterative loop. The data processing service receives a request from a client device to perform one or more data processing operations including applying the geospatial function to a data table of geospatial cell indices. The data processing service compiles the request into one or more tasks including at least a vectorized operation based on the compiled code and executes the one or more tasks by at least invoking the vectorized operation on the set of worker nodes.
-
-
-
-
-
-
-
-
-