-
公开(公告)号:US20250103753A1
公开(公告)日:2025-03-27
申请号:US18474708
申请日:2023-09-26
Applicant: Databricks, Inc.
Inventor: William Chau , Abhijit Chakankar , Stephen Michael Mahoney , Daniel Seth Morris , Itai Shlomo Weiss
IPC: G06F21/62
Abstract: A data processing service facilitates the creation and processing of data processing pipelines that process data processing jobs defined with respect to a set of tasks in a sequence and with data dependencies associated with each separate task such that the output from one task is used as input for a subsequent task. In various embodiments, the set of tasks include at least one cleanroom task that is executed in a cleanroom station and at least one non-cleanroom task executed in an execution environment of a user where each task is configured to read one or more input datasets and transform the one or more input datasets into one or more output datasets.
-
公开(公告)号:US20250086177A1
公开(公告)日:2025-03-13
申请号:US18745847
申请日:2024-06-17
Applicant: Databricks, Inc.
Inventor: Michael Paul Armbrust , Tathagata Das , Shi Xin , Matei Zaharia
IPC: G06F16/2453 , G06F16/2455
Abstract: A system for executing a streaming query includes an interface and a processor. The interface is configured to receive a logical query plan. The processor is configured to determine a physical query plan based at least in part on the logical query plan. The physical query plan comprises an ordered set of operators. Each operator of the ordered set of operators comprises an operator input mode and an operator output mode. The processor is further configured to execute the physical query plan using the operator input mode and the operator output mode for each operator of the query.
-
公开(公告)号:US12204523B2
公开(公告)日:2025-01-21
申请号:US18135078
申请日:2023-04-14
Applicant: Databricks, Inc.
Inventor: Zhaoxing Li , Rayman Preet Singh , Fuat Can Efeoglu , Daniel Tenedorio , Sarah Cai
IPC: G06F16/00 , G06F16/23 , G06F16/2455
Abstract: A system for retrieving and caching metadata from a remote data source is described. The system may receive a request from a client device. The request is to perform a query operation on a set of data objects stored in the remote data source. The system may access a metadata cache storing metadata information on one or more data objects of the remote data source and identify metadata corresponding to the set of data objects for the query operation in the metadata cache. The system may determine whether the identified metadata for the set of data objects meets an update condition. In response to the identified metadata meeting the update condition, the system may fetch updated metadata for at least the set of data objects from the remote data source, and store the updated metadata in the metadata cache.
-
公开(公告)号:US20250013644A1
公开(公告)日:2025-01-09
申请号:US18769269
申请日:2024-07-10
Applicant: Databricks, Inc.
Inventor: Bart Samwel , Tathagata Das , Lars Kroll , Yijia Cui , Juliusz Sompolski , Tom Van Bussel , Prakhar Jain
IPC: G06F16/2453 , G06F11/34 , G06F16/22 , G06F16/28
Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, obtaining one or more other resulting files based at least in part on unmatched rows, and obtaining a set of processed files based at least in part on performing a post-processing operation with respect to the set of resulting files. The set of processed files has less files than the set of resulting files. Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and obtaining the second job resulting file(s).
-
公开(公告)号:US12061586B2
公开(公告)日:2024-08-13
申请号:US17738609
申请日:2022-05-06
Applicant: Databricks, Inc.
Inventor: Bart Samwel , Prakhar Jain
CPC classification number: G06F16/2246 , G06F16/285
Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.
-
公开(公告)号:US20240256543A1
公开(公告)日:2024-08-01
申请号:US18160861
申请日:2023-01-27
Applicant: Databricks, Inc.
Inventor: Shoumik Palkar , Alexander Behm , Mostafa Mokhtar , Sriram Krishnamurthy
IPC: G06F16/2453 , G06F11/34 , G06F16/22
CPC classification number: G06F16/24545 , G06F11/3409 , G06F16/221
Abstract: Disclosed herein is a method for determining whether to apply a lazy materialization technique to a query run. A data processing service receives a request to perform a query identifying a filter column and a non-filter column in a columnar database. The data processing service accesses a first task of contiguous rows in the filter column from a cloud-based object storage. The data processing service applies a filter defined by the query to the first task. The data processing service generates filter results for the first task that may include a percentage of the first task discarded and a run-time. The data processing service determines, based on the filter results for the first task, a likelihood value that indicates a likelihood of gaining a performance benefit by applying the lazy materialization technique to a second task of the query.
-
公开(公告)号:US20240256531A1
公开(公告)日:2024-08-01
申请号:US18161475
申请日:2023-01-30
Applicant: Databricks, Inc.
IPC: G06F16/242 , G06F16/22 , G06F21/60
CPC classification number: G06F16/2448 , G06F16/2255 , G06F16/2291 , G06F21/602
Abstract: A system executes user defined functions (UDFs) invoked by database queries. The UDF includes UDF code specified using a programing language distinct from a database query language. A hash value from the UDF code provided by a client application for creating the UDF is compared with a hash value generated from UDF code invoked by database queries to determine whether the two UDF codes match. If the two hash values fail to match, the system takes an action, for example, storing an indication of UDF code mismatch or disabling subsequent executions of the database queries invoking the UDF. The system may use encoded UDF code that is decoded by the system at runtime using a key obtained from a separate system such as the client application. The client application can disable execution of database queries executing the UDF code by refusing to provide the key.
-
8.
公开(公告)号:US20240256426A1
公开(公告)日:2024-08-01
申请号:US18296876
申请日:2023-04-06
Applicant: Databricks, Inc.
Inventor: Gengliang Wang , Wenchen Fan , Serge Rielau , Entong Shen
IPC: G06F11/36 , G06F16/25 , G06F16/901
CPC classification number: G06F11/3612 , G06F16/258 , G06F16/9024
Abstract: A system executes database queries specified using a declarative database query language such as the structured query language (SQL). The system determines whether a runtime error is encountered during execution of a query, for example, a division by zero error, resource usage errors such as out of memory error, time out error, and so on. The system reports such runtime errors encountered during execution of a database query. The system identifies one or more origins of the runtime error in the database query. The origin identifies a portion of the database query that represents a cause of the runtime error. Reporting the origin of a runtime error in the database query simplifies the task of development and testing of database queries.
-
公开(公告)号:US12033041B2
公开(公告)日:2024-07-09
申请号:US17896281
申请日:2022-08-26
Applicant: Databricks, Inc.
Inventor: Benjamin Thomas Wilson , Corey Zumar
IPC: G06N20/00 , G06F18/20 , G06F18/2132
CPC classification number: G06N20/00 , G06F18/21322 , G06F18/285 , G06F18/21326
Abstract: The present application discloses a method, system, and computer system for building a model associated with a dataset. The method includes receiving a data set, the dataset comprising a plurality of keys and a plurality of key-value relationships, determining a plurality of models to build based at least in part on the dataset, wherein determining the plurality of models to build comprises using the dataset format information to identify the plurality of models, building the plurality of models, and optimizing at least one of the plurality of models.
-
公开(公告)号:US20240152496A1
公开(公告)日:2024-05-09
申请号:US18512028
申请日:2023-11-17
Applicant: Databricks, Inc.
Inventor: Aaron Daniel Davidson , Clemens Mewald , Tomas Nykodym
IPC: G06F16/21 , G06F16/955
CPC classification number: G06F16/219 , G06F16/955 , G06N5/022
Abstract: A system includes an interface, a processor, and a memory. The interface is configured to receive a version of a model from a model registry. The processor is configured to store the version of the model, start a process running the version of the model, and update a proxy with version information associated with the version of the model, wherein the updated proxy indicates to redirect an indication to invoke the version of the model to the process. The memory is coupled to the processor and configured to provide the processor with instructions.
-
-
-
-
-
-
-
-
-