Abstract:
Computer-implemented systems and methods are disclosed for comparing and associating objects. In some embodiments, a method is provided for associating a first object with one or more objects within a plurality of objects, each object comprising a first plurality of properties, each property comprising data reflecting a characteristic of an entity represented by the object, the associated objects comprising matching data in corresponding properties for a second plurality of properties. The method may include executing, for each object within the plurality of objects and for the first object, the following: creating a slug for the object, the slug comprising the second plurality of properties from the object; and inputting the slug for the object into a Bloom filter. Further, the method may include creating for a bin within the Bloom filter corresponding to the slug for the first object, an association between objects whose slugs correspond to the bin if the slugs for those objects match.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media for data security protection are provided. One of the methods includes: receiving a job associated with a project, wherein the project is associated with one or more data sources; identifying a plurality of inputs and a plurality of outputs associated with the job; determining a plurality of required permissions associated with the job, wherein each of the required permissions comprises an operation on a required data source, the operation corresponding to at least one of the inputs or the outputs; verifying that the one or more data sources associated with the project comprise the required data source associated with each of the required permissions; and generating a token associated with the job, the token encoding the required permissions associated with the job, wherein the token is required for execution of the job.
Abstract:
A computer-implemented method comprises detecting, by a processor of a first host of one or more hosts in a distributed computing environment, a distributed task waiting to be started, from a replicated configuration system, the distributed task being represented by a pending tasks key. The method comprises starting, by the processor, the distributed task by performing an atomic compare and swap operation to add a started key to the replicated configuration system. The method also comprises writing a specification of the distributed task to the replicated configuration system under a new version of a current tasks key. In addition, the method comprises removing, following the writing, the pending tasks key from the replicated configuration system.
Abstract:
Systems and methods are provided for data migration. The system may comprise one or more processors and a memory storing instructions that, when executed by the one or more processors, cause the system to migrate at least one first table of a first database schema to at least one second table of a second database schema, determine a query for modifying the first table during the migration, modify the second table based at least in part on the query, and update a mutation table to describe the modification, wherein the mutation table at least describes the modification.
Abstract:
An apparatus, and a method, performed by one or more processors are disclosed. The method receives a build request associated with performing an external data processing task on a first data set, the first data set being stored in memory associated with a data processing platform to be performed at a system external to the data processing platform. The method generates a task identifier for the data processing task, and provides, in association with the task identifier, the first data set to an agent associated with the external system with an indication of the data processing task, the agent being arranged to cause performance of the task at the external system, to receive a second data set resulting from performance of the task, and to provide the second data set and associated metadata indicative of the transformation. The method receives the second data set and metadata from the agent associated with the external system and stores the second data set and associated metadata.
Abstract:
A method and system for serving assets is disclosed, comprising receiving an asset request to serve an asset, wherein the asset request originates at an application, and wherein the asset request comprises an advertisement of an asset to be served and a request for the network address of an asset server configured to serve the requested asset. The method further comprises causing a service discovery server to identify an asset server configured to serve the requested asset, and causing the requested asset to be served to the application.
Abstract:
Disclosed herein is a data structure which includes a sequence of events, each event associated with a sequence number indicating a temporal position of an event within the sequence of events; one or more read-offsets, each read-offset associated with a consumer, wherein each read-offset indicates a sequence number up to which a consumer has read events within the sequence of events; and at least one snapshot which represents events with sequence numbers smaller than the smallest read-offset in a compacted form. Disclosed herein is also a computer-implemented method of maintaining the data structure. Disclosed herein is a computer-implemented method, wherein the method is performed on a sequence of events accessible by a plurality of consumers, each event associated with a sequence number indicating a temporal position of an event within the sequence of events, each consumer associated with a read-offset indicating the sequence number up to which the consumer has read events within the sequence of events, the method includes determining a smallest read-offset of all read-offsets; compacting events with sequence numbers smaller than the smallest read-offset into a snapshot; and replacing the events with sequence numbers smaller than the smallest read-offset with the snapshot. Disclosed herein are corresponding computer-readable media and computing systems.
Abstract:
A database system comprised of a decoupled compute layer and storage layer is implemented to store, build, and maintain a canonical dataset, a temporary buffer, and projection datasets. The canonical dataset is a set of batch updated data. The data is appended in chunks to the canonical dataset such that the canonical dataset becomes a historical dataset over time. The buffer is a write ahead log that contains the most recent chunks of data and provides atomicity and durability for the database system. The projection datasets are indexes of the canonical dataset and/or the buffer that may have single or multiple column sort-orders and/or particular data formats. The writes to the canonical dataset, projection datasets, and buffer may be asynchronous and therefore the database system is advantageously less resource constrained.
Abstract:
A database system comprised of a decoupled compute layer and storage layer is implemented to store, build, and maintain a canonical dataset, a temporary buffer, and an edits dataset. The canonical dataset is a set of batch updated data. The data is appended in chunks to the canonical dataset such that the canonical dataset becomes a historical dataset over time. The buffer is a write ahead log that contains the most recent chunks of data and provides atomicity and durability for the database system. The edits dataset is the set of data that contains edits such as cell mutations, row appends and/or row deletions. The database system enables users to make cell or row-level edits to tables and observe those edits in analytical systems or downstream builds with minimal latency.
Abstract:
A method and system for serving assets is disclosed, comprising receiving an asset request to serve an asset, wherein the asset request originates at an application, and wherein the asset request comprises an advertisement of an asset to be served and a request for the network address of an asset server configured to serve the requested asset. The method further comprises causing a service discovery server to identify an asset server configured to serve the requested asset, and causing the requested asset to be served to the application.