BUILD-SIDE SKEW HANDLING FOR HASH-PARTITIONING HASH JOINS

    公开(公告)号:US20240134851A1

    公开(公告)日:2024-04-25

    申请号:US18047872

    申请日:2022-10-18

    Applicant: Snowflake Inc.

    CPC classification number: G06F16/24537 G06F16/2255

    Abstract: Provided herein are systems and methods for handling build-side skew. For example, a method includes computing a plurality of hash values for a join operation. The join operation uses a corresponding plurality of row sets. The plurality of hash values are sampled to detect a frequent hash value. A build-side row set is partitioned using the frequent hash value to generate a partitioned build-side row set. The build-side row set is selected from the plurality of row sets. The partitioned build-side row set is distributed to a plurality of hash-join-build (HJB) instances executing at a corresponding plurality of servers.

    GENERATING DATA DICTIONARY METADATA
    384.
    发明公开

    公开(公告)号:US20240111885A1

    公开(公告)日:2024-04-04

    申请号:US18306704

    申请日:2023-04-25

    Applicant: Snowflake Inc.

    CPC classification number: G06F21/6218 G06F21/604 G06F2221/2141

    Abstract: A data dictionary generation system utilizes a background service that is programmed to automatically populate and update a data dictionary for listings offering shared data. A data dictionary includes metadata describing the shared data overall as well as the individual objects included in the listing, such as the individual tables, schemas, views, and functions. To generate the data dictionary, the data dictionary generation system analyzes the shared data to identify objects, identifies a set of data fields associated with each identified object and populates the set of data fields associated with each identified object based on the shared data offered by the listing. To ensure that a data dictionary for each listing remains up to date, the data dictionary generation system periodically scans the listings to identify any changes to share access granted to the listings.

    Handling system-characteristics drift in machine learning applications

    公开(公告)号:US11934927B2

    公开(公告)日:2024-03-19

    申请号:US18087518

    申请日:2022-12-22

    Applicant: SNOWFLAKE INC.

    CPC classification number: G06N20/00 G06F16/24

    Abstract: Systems and methods for managing input and output error of a machine learning (ML) model in a database system are presented herein. A set of test queries is executed on a first version of a database system to generate first test data, wherein the first version of the system comprises a ML model to generate an output corresponding to a function of the database system. An error model is trained based on the first test data and second test data generated based on a previous version of the system. The error model determines an error associated with the ML model between the first and previous versions of the system. The first version of the system is deployed with the error model, which corrects an output or an input of the ML model until sufficient data has been produced by the error model to retrain the ML model.

    EFFICIENT DEDUPLICATION OF RANDOMIZED FILE PATHS

    公开(公告)号:US20240086381A1

    公开(公告)日:2024-03-14

    申请号:US18513163

    申请日:2023-11-17

    Applicant: Snowflake Inc.

    CPC classification number: G06F16/215 G06F16/24552 G06F16/24573 G06F16/248

    Abstract: Disclosed are techniques for deduplicating files to be ingested by a database. A bloom filter may be built for each of a first set of files to be ingested into a data exchange to generate a set of bloom filters, wherein each of the set of bloom filters is built with a number of hash functions that is based on a desired false positive rate. The set of bloom filters may be stored in the metadata storage of the data exchange. In response to receiving a set of candidate files to be ingested, identifying using the set of bloom filters, candidate files from the set of candidate files that are duplicative of a file in the first set of files and pruning from the set of candidate files, each candidate file identified as being duplicative of a file in the first set of files using the set of bloom filters.

    Fetching query result data using result batches

    公开(公告)号:US11921733B2

    公开(公告)日:2024-03-05

    申请号:US17813662

    申请日:2022-07-20

    Applicant: Snowflake Inc.

    Abstract: Techniques for fetching query result data using result batches includes generating a plurality of result batches based on query result information. The query result information is associated with query result data generated from execution of a query. Each result batch of the plurality of result batches includes a result data retrieval function for a corresponding data portion of a plurality of data portions of the query result data. The plurality of result batches are encoded for distribution to a corresponding plurality of computing nodes. The techniques further include causing retrieving of the plurality of data portions of the query result data by the corresponding plurality of computing nodes based on the result data retrieval function for each of the plurality of data portions.

Patent Agency Ranking