DATA LAKE WITH TRANSACTIONAL SEMANTICS
    1.
    发明公开

    公开(公告)号:US20230385265A1

    公开(公告)日:2023-11-30

    申请号:US17827795

    申请日:2022-05-30

    Applicant: VMware, Inc.

    CPC classification number: G06F16/2365 G06F16/2282 G06F11/1435

    Abstract: A version control interface provides for accessing a data lake with transactional semantics. Examples generate a plurality of tables for data objects stored in the data lake. The tables each comprise a set of name fields and map a space of columns or rows to a set of the data objects. Transactions read and write data objects and may span a plurality of tables with properties of atomicity, consistency, isolation, durability (ACID). Performing the transaction comprises: accumulating transaction-incomplete messages, indicating that the transaction is incomplete, until a transaction-complete message is received, indicating that the transaction is complete. Upon this occurring, a master branch is updated to reference the data objects according to the transaction-incomplete messages and the transaction-complete message. Tables may be grouped into data groups that provide atomicity boundaries so that different groups may be served by different master branches, thereby improving the speed of master branch updates.

    FAST ALGORITHM TO FIND FILE SYSTEM DIFFERENCE FOR DEDUPLICATION

    公开(公告)号:US20210064580A1

    公开(公告)日:2021-03-04

    申请号:US16552965

    申请日:2019-08-27

    Applicant: VMware, Inc.

    Abstract: The disclosure provides techniques for deduplicating files. The techniques include, upon creating or modifying a file, placing a logical timestamp of the current logical time, within a queue associated with the directory of the file. The techniques further include placing the logical timestamp within a queue of each parent directory of the directory of the file. To determine a set of files for deduplication, the techniques disclosed herein identify files that have been modified within a logical time range. The set of files modified within a logical time is identified by traversing directories of a storage system, the directories being organized within a tree structure. If a directory's queue does not contain a timestamp that is within the logical time range, then all child directories can be skipped over for further processing, such that no files within the child directories end up being within the set of files for deduplication.

    OPTIMIZING REFERENCES TO CHANGING DATA SETS IN DISTRIBUTED DATA LAKES

    公开(公告)号:US20240248905A1

    公开(公告)日:2024-07-25

    申请号:US18159667

    申请日:2023-01-25

    Applicant: VMware, Inc.

    CPC classification number: G06F16/254

    Abstract: References to changing data sets in distributed data lakes are optimized. As part of a transaction, a first message is received. The first message identifies a table and first data to be written to the table. Based on at least the table, the first message is routed to a first ingestion node of a plurality of ingestion nodes. The first data is persisted in temporary storage. Location information of the persisted first data is determined. A data available message comprising a self-describing reference to the first data is published, by the first ingestion node, to a first reader node of a plurality of reader nodes. The self-describing reference identifies the first ingestion node, the location information of the first data, and a range of the first data.

    OPTIMIZING FILE ACCESS STATISTICS COLLECTION

    公开(公告)号:US20220292061A1

    公开(公告)日:2022-09-15

    申请号:US17202342

    申请日:2021-03-15

    Applicant: VMware, Inc.

    Abstract: Optimizing file access includes a process for identifying a file access event for a first accessed file, and incrementing a first access counter in an access list in a memory, which also includes access counters for other accessed files. The process further includes exporting the first access counter to a performance monitoring dashboard, or exporting to a storage allocator and, based on the value, moving the first accessed file between a first storage and a second storage. The process also includes determining whether the value of the first access counter meets a first threshold, or a sum of values of the access counters for the other accessed files meets a second threshold. Based on meeting the first threshold or meeting the second threshold, the process includes persisting the access counters on a storage media. The access counters also provide security monitoring (e.g., identifying excessive file access).

    TRADING OFF CACHE SPACE AND WRITE AMPLIFICATION FOR B(epsilon)-TREES

    公开(公告)号:US20200233801A1

    公开(公告)日:2020-07-23

    申请号:US16252488

    申请日:2019-01-18

    Applicant: VMware, Inc.

    Abstract: Certain aspects provide systems and methods for performing an operation on a Bε-tree. A method comprises writing a message associated with the operation to a first slot in a first buffer of a first non-leaf node of the Bε-tree in an append-only manner, wherein a first filter associated with the first slot is used for query operations associated with the first slot. The method further comprises determining that the first buffer is full and, upon determining to flush the message to a non-leaf child node, flushing the message in an append-only manner to a second slot in a second buffer of the non-leaf child node, wherein a second filter associated with the second slot is used for query operations associated with the second slot. The method further comprises, upon determining to flush the message to a leaf node, flushing the message to the leaf node in a sorted manner.

    RANGE LOOKUP OPERATIONS FOR B E-TREES USING UPDATE MESSAGES

    公开(公告)号:US20190294709A1

    公开(公告)日:2019-09-26

    申请号:US15927019

    申请日:2018-03-20

    Applicant: VMware, Inc.

    Abstract: Exemplary methods, apparatuses, and systems include a file system process inserting a first key/value pair and a second key/value pair into a first tree. The second key is a duplicate of the first key and the value of the second key/value pair is an operation changing the value. In response to a request for a range of key/value pairs, the process reads the second key/value pair and inserts it in a second tree. The process reads the first pair and determines, while inserting the first pair in the second tree, that the second key is a duplicate of the first key. The file system process determines an updated value of the first value by applying the operation in the second value to first value. The file system operation updates the second key/value pair in the second tree with the updated value and returns the requested range of key/value pairs.

    FILE SERVICE AUTO-REMEDIATION IN STORAGE SYSTEMS

    公开(公告)号:US20210334178A1

    公开(公告)日:2021-10-28

    申请号:US16859944

    申请日:2020-04-27

    Applicant: VMware, Inc.

    Abstract: System and method for automatic remediation for a distributed file system uses a file system (FS) remediation module running in a cluster management server and FS remediation agents running in a cluster of host computers. The FS remediation module monitors the cluster of host computers for related events. When a first file system service (FSS)-impacting event is detected, a cluster-level remediation action is executed at the cluster management server by the FS remediation module in response to the detected first FSS-impacting event. When a second FSS-impacting event is detected, a host-level remediation action is executed at one or more of the host computers in the cluster by the FS remediation agents in response to the detected second FSS-impacting event.

    SYSTEM AND METHOD OF A HIGHLY CONCURRENT CACHE REPLACEMENT ALGORITHM

    公开(公告)号:US20210141728A1

    公开(公告)日:2021-05-13

    申请号:US16679570

    申请日:2019-11-11

    Applicant: VMware, Inc.

    Abstract: Disclosed are a method and system for managing multi-threaded concurrent access to a cache data structure. The cache data structure includes a hash table and three queues. The hash table includes a list of elements for each hash bucket with each hash bucket containing a mutex object and elements in each of the queues containing lock objects. Multiple threads can each lock a different hash bucket to have access to the list, and multiple threads can each lock a different element in the queues. The locks permit highly concurrent access to the cache data structure without conflict. Also, atomic operations are used to obtain pointers to elements in the queues so that a thread can safely advance each pointer. Race conditions that are encountered with locking an element in the queues or entering an element into the hash table are detected, and the operation encountering the race condition is retried.

Patent Agency Ranking