Efficient Merging of Tabular Data with Post-Processing Compaction

    公开(公告)号:US20250013644A1

    公开(公告)日:2025-01-09

    申请号:US18769269

    申请日:2024-07-10

    Abstract: A method, system, and computer system for performing an operation with respect to a target table are disclosed. The method includes performing first and second jobs, obtaining one or more other resulting files based at least in part on unmatched rows, and obtaining a set of processed files based at least in part on performing a post-processing operation with respect to the set of resulting files. The set of processed files has less files than the set of resulting files. Performing the first job includes determining a set of matching target table files and storing target table information indicating for each of the set of matching target table files, a particular set of rows having matching rows. Performing the second job includes performing a matching action based on matched rows and obtaining the second job resulting file(s).

    K-D tree balanced splitting
    2.
    发明授权

    公开(公告)号:US12061586B2

    公开(公告)日:2024-08-13

    申请号:US17738609

    申请日:2022-05-06

    CPC classification number: G06F16/2246 G06F16/285

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

    DATA FILE CLUSTERING WITH KD-EPSILON TREES

    公开(公告)号:US20250013619A1

    公开(公告)日:2025-01-09

    申请号:US18218766

    申请日:2023-07-06

    Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.

    DATA FILE CLUSTERING WITH KD-CLASSIFIER TREES

    公开(公告)号:US20250013606A1

    公开(公告)日:2025-01-09

    申请号:US18218410

    申请日:2023-07-05

    Abstract: A data processing service generates a data classifier tree for managing data files of a data table. The data classifier tree may be configured as a KD-classifier tree and includes a plurality of nodes and edges. A node of the data classifier tree may represent a splitting condition with respect to key-values for a respective key. A node of the data classifier tree may be associated with one or more data files assigned to the node. The data files assigned to the node each include a subset of records having key-values that satisfy the conditions represented by the node and parent nodes of the node. The data processing service may efficiently cluster the data in the data table while reducing the number of data files that are rewritten when data is modified or added to the data table.

    Data maintenance transaction rollbacks

    公开(公告)号:US12072843B1

    公开(公告)日:2024-08-27

    申请号:US17580475

    申请日:2022-01-20

    CPC classification number: G06F16/174

    Abstract: The present application discloses a method, system, and computer system for managing a data in a storage system. The method includes receiving a first transaction that modifies or deletes first data stored in a storage system, determining that the first data is subject to an intervening re-arrangement transaction, and in response to determining that the first data is subject to the intervening re-arrangement transaction, rolling back the re-arrangement transaction at least with respect to the first data and committing the first transaction.

    Data ingestion using data file clustering with KD-epsilon trees

    公开(公告)号:US12072863B1

    公开(公告)日:2024-08-27

    申请号:US18218400

    申请日:2023-07-05

    CPC classification number: G06F16/2246 G06F16/2358 G06F16/245 G06F16/285

    Abstract: A data tree for managing data files of a data table and performing one or more transaction operations to the data table is described. The data tree is configured as a KD-epsilon tree and includes a plurality of nodes and edges. A node of the data tree may represent a splitting condition with respect to key-values for a respective key. A leaf node of the data tree may correspond to a data file for a data table that includes a subset of records having key-values that satisfy the condition for the node and conditions associated with parent nodes of the node. A parent node may correspond to a file including a buffer that stores changes to data files reachable by this parent node, and also includes dedicated storage to pointers of the child nodes. By using the data tree, the data processing system may efficiently cluster the data in the data table while reducing the number of data files that are rewritten.

    K-D Tree Balanced Splitting
    9.
    发明申请

    公开(公告)号:US20250086155A1

    公开(公告)日:2025-03-13

    申请号:US18772758

    申请日:2024-07-15

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

    K-D TREE BALANCED SPLITTING
    10.
    发明公开

    公开(公告)号:US20230359602A1

    公开(公告)日:2023-11-09

    申请号:US17738609

    申请日:2022-05-06

    CPC classification number: G06F16/2246

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

Patent Agency Ranking