DATA FILE CLUSTERING WITH KD-EPSILON TREES

    公开(公告)号:US20250013619A1

    公开(公告)日:2025-01-09

    申请号:US18218766

    申请日:2023-07-06

    Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.

    Efficient merge of tabular data with deletion indications

    公开(公告)号:US12045220B2

    公开(公告)日:2024-07-23

    申请号:US17895890

    申请日:2022-08-25

    CPC classification number: G06F16/2282 G06F9/4881

    Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, and persist, in one or more deletion vector files, one or more deletion vectors for corresponding rows of the one or more target table files, and obtaining a resulting table based at least in part on the second job resulting file(s). Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and one or more deletion of vectors associated with previously removed rows of the matching target table files and obtaining the second job resulting file(s).

    EFFICIENT MERGE OF TABULAR DATA WITH DELETION INDICATIONS

    公开(公告)号:US20240070138A1

    公开(公告)日:2024-02-29

    申请号:US17895890

    申请日:2022-08-25

    CPC classification number: G06F16/2282 G06F9/4881

    Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, and persist, in one or more deletion vector files, one or more deletion vectors for corresponding rows of the one or more target table files, and obtaining a resulting table based at least in part on the second job resulting file(s). Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and one or more deletion of vectors associated with previously removed rows of the matching target table files and obtaining the second job resulting file(s).

    Efficient Merging of Tabular Data with Post-Processing Compaction

    公开(公告)号:US20250013644A1

    公开(公告)日:2025-01-09

    申请号:US18769269

    申请日:2024-07-10

    Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, obtaining one or more other resulting files based at least in part on unmatched rows, and obtaining a set of processed files based at least in part on performing a post-processing operation with respect to the set of resulting files. The set of processed files has less files than the set of resulting files. Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and obtaining the second job resulting file(s).

    K-D tree balanced splitting
    5.
    发明授权

    公开(公告)号:US12061586B2

    公开(公告)日:2024-08-13

    申请号:US17738609

    申请日:2022-05-06

    CPC classification number: G06F16/2246 G06F16/285

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

    DATA FILE CLUSTERING WITH KD-CLASSIFIER TREES

    公开(公告)号:US20250013606A1

    公开(公告)日:2025-01-09

    申请号:US18218410

    申请日:2023-07-05

    Abstract: A data processing service generates a data classifier tree for managing data files of a data table. The data classifier tree may be configured as a KD-classifier tree and includes a plurality of nodes and edges. A node of the data classifier tree may represent a splitting condition with respect to key-values for a respective key. A node of the data classifier tree may be associated with one or more data files assigned to the node. The data files assigned to the node each include a subset of records having key-values that satisfy the conditions represented by the node and parent nodes of the node. The data processing service may efficiently cluster the data in the data table while reducing the number of data files that are rewritten when data is modified or added to the data table.

    Concurrent optimistic transactions for tables with deletion vectors

    公开(公告)号:US12147412B2

    公开(公告)日:2024-11-19

    申请号:US18156109

    申请日:2023-01-18

    Abstract: A disclosed configuration receives a first indication that a first transaction is committed to update a first subset of records in a data table at a first version to generate a second version of the data table and receiving a second indication to commit a second transaction to update a second subset of records in a data file of the data table at the first version. The configuration determines a logical prerequisite based on whether the first subset of records changes content of one or more records in the second subset of records and determining a physical prerequisite on whether the second subset of records corresponds to respective data records in data files of the second version of the data table. The configuration commits the second transaction to generate a third version of the data table by updating elements of the deletion vector if the prerequisites are satisfied.

    Data maintenance transaction rollbacks

    公开(公告)号:US12072843B1

    公开(公告)日:2024-08-27

    申请号:US17580475

    申请日:2022-01-20

    CPC classification number: G06F16/174

    Abstract: The present application discloses a method, system, and computer system for managing a data in a storage system. The method includes receiving a first transaction that modifies or deletes first data stored in a storage system, determining that the first data is subject to an intervening re-arrangement transaction, and in response to determining that the first data is subject to the intervening re-arrangement transaction, rolling back the re-arrangement transaction at least with respect to the first data and committing the first transaction.

    CONCURRENT OPTIMISTIC TRANSACTIONS FOR TABLES WITH DELETION VECTORS

    公开(公告)号:US20240241877A1

    公开(公告)日:2024-07-18

    申请号:US18156109

    申请日:2023-01-18

    CPC classification number: G06F16/2315 G06F16/2358 G06F16/2379

    Abstract: A disclosed configuration receives a first indication that a first transaction is committed to update a first subset of records in a data table at a first version to generate a second version of the data table and receiving a second indication to commit a second transaction to update a second subset of records in a data file of the data table at the first version. The configuration determines a logical prerequisite based on whether the first subset of records changes content of one or more records in the second subset of records and determining a physical prerequisite on whether the second subset of records corresponds to respective data records in data files of the second version of the data table. The configuration commits the second transaction to generate a third version of the data table by updating elements of the deletion vector if the prerequisites are satisfied.

    K-D Tree Balanced Splitting
    10.
    发明申请

    公开(公告)号:US20250086155A1

    公开(公告)日:2025-03-13

    申请号:US18772758

    申请日:2024-07-15

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

Patent Agency Ranking