OPTIMIZING REFERENCES TO CHANGING DATA SETS IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248905A1

    公开(公告)日:2024-07-25

    申请号:US18159667

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/254

    Abstract: References to changing data sets in distributed data lakes are optimized. As part of a transaction, a first message is received. The first message identifies a table and first data to be written to the table. Based on at least the table, the first message is routed to a first ingestion node of a plurality of ingestion nodes. The first data is persisted in temporary storage. Location information of the persisted first data is determined. A data available message comprising a self-describing reference to the first data is published, by the first ingestion node, to a first reader node of a plurality of reader nodes. The self-describing reference identifies the first ingestion node, the location information of the first data, and a range of the first data.

    TRADING OFF CACHE SPACE AND WRITE AMPLIFICATION FOR B(epsilon)-TREES

    公开(公告)号:US20200233801A1

    公开(公告)日:2020-07-23

    申请号:US16252488

    申请日:2019-01-18

    Applicant: VMware, Inc.

    Abstract: Certain aspects provide systems and methods for performing an operation on a Bε-tree. A method comprises writing a message associated with the operation to a first slot in a first buffer of a first non-leaf node of the Bε-tree in an append-only manner, wherein a first filter associated with the first slot is used for query operations associated with the first slot. The method further comprises determining that the first buffer is full and, upon determining to flush the message to a non-leaf child node, flushing the message in an append-only manner to a second slot in a second buffer of the non-leaf child node, wherein a second filter associated with the second slot is used for query operations associated with the second slot. The method further comprises, upon determining to flush the message to a leaf node, flushing the message to the leaf node in a sorted manner.

    RANGE LOOKUP OPERATIONS FOR B E-TREES USING UPDATE MESSAGES

    公开(公告)号:US20190294709A1

    公开(公告)日:2019-09-26

    申请号:US15927019

    申请日:2018-03-20

    Applicant: VMware, Inc.

    Abstract: Exemplary methods, apparatuses, and systems include a file system process inserting a first key/value pair and a second key/value pair into a first tree. The second key is a duplicate of the first key and the value of the second key/value pair is an operation changing the value. In response to a request for a range of key/value pairs, the process reads the second key/value pair and inserts it in a second tree. The process reads the first pair and determines, while inserting the first pair in the second tree, that the second key is a duplicate of the first key. The file system process determines an updated value of the first value by applying the operation in the second value to first value. The file system operation updates the second key/value pair in the second tree with the updated value and returns the requested range of key/value pairs.

    MERGE UPDATES FOR KEY VALUE STORES
    6.
    发明申请

    公开(公告)号:US20190080107A1

    公开(公告)日:2019-03-14

    申请号:US15703706

    申请日:2017-09-13

    Applicant: VMware, Inc.

    Abstract: Embodiments of the present disclosure relate to techniques for performing a merge update for a database. In particular, certain embodiments of a method include generating a message comprising a first key and a first transaction associated with the first key, the first transaction indicating a transaction to perform other than for key-value pairs comprising the first key. The method further includes storing the message in a database. The method further includes merging the message with a first key-value pair stored in the database, the first-key value pair comprising the first key. The method further includes performing the first transaction based on merging the message with the first key-value pair.

    DATA LAKE WITH TRANSACTIONAL SEMANTICS
    7.
    发明公开

    公开(公告)号:US20230385265A1

    公开(公告)日:2023-11-30

    申请号:US17827795

    申请日:2022-05-30

    Applicant: VMware, Inc.

    CPC classification number: G06F16/2365 G06F16/2282 G06F11/1435

    Abstract: A version control interface provides for accessing a data lake with transactional semantics. Examples generate a plurality of tables for data objects stored in the data lake. The tables each comprise a set of name fields and map a space of columns or rows to a set of the data objects. Transactions read and write data objects and may span a plurality of tables with properties of atomicity, consistency, isolation, durability (ACID). Performing the transaction comprises: accumulating transaction-incomplete messages, indicating that the transaction is incomplete, until a transaction-complete message is received, indicating that the transaction is complete. Upon this occurring, a master branch is updated to reference the data objects according to the transaction-incomplete messages and the transaction-complete message. Tables may be grouped into data groups that provide atomicity boundaries so that different groups may be served by different master branches, thereby improving the speed of master branch updates.

    OPTIMIZING STORAGE FILE SIZE IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248879A1

    公开(公告)日:2024-07-25

    申请号:US18159677

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/172 G06F16/122 G06F16/1724

    Abstract: Storage file size in distributed data lakes is optimized. At a first ingestion node of a plurality of ingestion nodes, a merge advisory is received from a coordinator. The merge advisory indicates a transaction identifier (ID). Received data associated with the transaction ID is persisted, which includes: determining whether the received data, persisted together in a single file will exceed a maximum desired file size; based on determining that the maximum desired file size will not be exceeded, persisting the received data in a single file; and based on determining that the maximum desired file size will be exceeded, persisting the received data in a plurality of files that each does not exceed the maximum desired file size. A location of the persisted received data in the permanent storage is identified, by the first ingestion node, to the coordinator.

    TRANSACTION-AWARE TABLE PLACEMENT
    9.
    发明公开

    公开(公告)号:US20240126744A1

    公开(公告)日:2024-04-18

    申请号:US17967286

    申请日:2022-10-17

    Applicant: VMware, Inc.

    CPC classification number: G06F16/2379 G06F16/2282

    Abstract: Intelligent, transaction-aware table placement minimizes cross-host transactions while supporting full transactional semantics and delivering high throughput at low resource utilization. This placement reducing delays caused by cross-host transaction coordination. Examples determine a count of historical interactions between tables, based on at least a transaction history for a plurality of cross-table transactions. Each table provides an abstraction for data, such as by identifying data objects stored in a data lake. For tables on different hosts, having high count of historical interactions, potential cost savings achievable by moving operational control of a first table to the same host as the second table is compared with the potential cost savings achievable by moving operational control of the second table to the same host as the first table. Based on comparing the relative cost savings, one of the tables may be selected. Operational control of the selected table is moved without moving any of the data objects.

    VERSION CONTROL INTERFACE FOR ACCESSING DATA LAKES

    公开(公告)号:US20230205757A1

    公开(公告)日:2023-06-29

    申请号:US17564206

    申请日:2021-12-28

    Applicant: VMware, Inc.

    CPC classification number: G06F16/2379 G06F16/2246

    Abstract: A version control interface for data provides a layer of abstraction that permits multiple readers and writers to access data lakes concurrently. An overlay file system, based on a data structure such as a tree, is used on top of one or more underlying storage instances to implement the interface. Each tree node tree is identified and accessed by means of any universally unique identifiers. Copy-on-write with the tree data structure implements snapshots of the overlay file system. The snapshots support a long-lived master branch, with point-in-time snapshots of its history, and one or more short-lived private branches. As data objects are written to the data lake, the private branch corresponding to a writer is updated. The private branches are merged back into the master branch using any merging logic, and conflict resolution policies are implemented. Readers read from the updated master branch or from any of the private branches.

Patent Agency Ranking