-
公开(公告)号:US10936432B1
公开(公告)日:2021-03-02
申请号:US14495408
申请日:2014-09-24
Applicant: Amazon Technologies, Inc.
Inventor: Tin-Yu Lee , Rejith George Joseph , Scott Michael Le Grand , Saurabh Dileep Baji
Abstract: Methods, systems, and computer-readable media for implementing a fault-tolerant parallel computation framework are disclosed. Execution of an application comprises execution of a plurality of processes in parallel. Process states for the processes are stored during the execution of the application. The processes use a message passing interface for exchanging messages with one other. The messages are exchanged and the process states are stored at a plurality of checkpoints during execution of the application. A final successful checkpoint is determined after the execution of the application is terminated. The final successful checkpoint represents the most recent checkpoint at which the processes exchanged messages successfully. Execution of the application is resumed from the final successful checkpoint using the process states stored at the final successful checkpoint.
-
公开(公告)号:US10148736B1
公开(公告)日:2018-12-04
申请号:US14281582
申请日:2014-05-19
Applicant: Amazon Technologies, Inc.
Inventor: Tin-Yu Lee , Rejith George Joseph , Scott Michael Le Grand , Saurabh Dileep Baji , Peter Sirota
Abstract: A client may submit a job to a service provider that processes a large data set and that employs a message passing interface (MPI) to coordinate the collective execution of the job on multiple compute nodes. The framework may create a MapReduce cluster (e.g., within a VPC) and may generate a single key pair for the cluster, which may be downloaded by nodes in the cluster and used to establish secure node-to-node communication channels for MPI messaging. A single node may be assigned as a mapper process and may launch the MPI job, which may fork its commands to other nodes in the cluster (e.g., nodes identified in a hostfile associated with the MPI job), according to the MPI interface. A rankfile may be used to synchronize the MPI job and another MPI process used to download portions of the data set to respective nodes in the cluster.
-
公开(公告)号:US20200218569A1
公开(公告)日:2020-07-09
申请号:US16818297
申请日:2020-03-13
Applicant: Amazon Technologies, Inc.
Inventor: Dougal Stuart Ballantyne , James Edward Kinney, JR. , Aswin Damodar , Chetan Hosmani , Rejith George Joseph , Chris William Ramsey , Kiuk Chung , Jason Roy Rupard
Abstract: A scheduler of a batch job management service determines that a set of resources a client is insufficient to execute one or more jobs. The scheduler prepares a multi-dimensional statistical representation of resource requirements of the jobs, and transmits it to a resource controller. The resource controller uses the multi-dimensional representation and resource usage state information to make resource allocation change decisions.
-
公开(公告)号:US20180143852A1
公开(公告)日:2018-05-24
申请号:US15360948
申请日:2016-11-23
Applicant: Amazon Technologies, Inc.
Inventor: DOUGAL STUART BALLANTYNE , JAMES EDWARD KINNEY, JR. , Aswin Damodar , Chetan Hosmani , Rejith George Joseph , Chris William Ramsey , Kiuk Chung , Jason Roy Rupard
CPC classification number: G06F9/4881 , G06F9/5016 , G06F9/5027 , G06F9/5072
Abstract: A scheduler of a batch job management service determines that a set of resources a client is insufficient to execute one or more jobs. The scheduler prepares a multi-dimensional statistical representation of resource requirements of the jobs, and transmits it to a resource controller. The resource controller uses the multi-dimensional representation and resource usage state information to make resource allocation change decisions.
-
公开(公告)号:US20170193368A1
公开(公告)日:2017-07-06
申请号:US14984510
申请日:2015-12-30
Applicant: Amazon Technologies, Inc.
Inventor: Scott Michael Le Grand , Rejith George Joseph
IPC: G06N3/08
Abstract: The present disclosure is directed to parallelization of artificial neural network processing by conditionally synchronizing, among multiple computer processors, either the input or output of individual operations, and by conditionally using either rows or columns of certain matrices used in the operations. The conditional processing may depend upon the relative sizes of the input and output of the specific operations to be performed. For example, if a current layer matrix of values is larger than a next layer matrix of values to be computed, then rows of a weight matrix may be used by the computer processors to compute the next layer matrix. If the current layer matrix is smaller than the next layer matrix, then columns of the weight matrix may be used by the computer processors to compute the next layer matrix.
-
公开(公告)号:US10896459B1
公开(公告)日:2021-01-19
申请号:US16842432
申请日:2020-04-07
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Oleg Rybakov
Abstract: Some aspects of the present disclosure relate to generating and training a neural network by separating historical item interaction data into both inputs and outputs. This may be done, for example, based on date. For example, a neural network machine learning technique may be used to generate a prediction model using a set of inputs that includes both a number of items purchased by a number of users before a certain date as well as some or all attributes of those items, and a set of outputs that includes the items purchased after that date. The items purchased before that date and the associated attributes can be subjected to a time-decay function.
-
公开(公告)号:US10482380B2
公开(公告)日:2019-11-19
申请号:US14984510
申请日:2015-12-30
Applicant: Amazon Technologies, Inc.
Inventor: Scott Michael Le Grand , Rejith George Joseph
IPC: G06N3/08
Abstract: The present disclosure is directed to parallelization of artificial neural network processing by conditionally synchronizing, among multiple computer processors, either the input or output of individual operations, and by conditionally using either rows or columns of certain matrices used in the operations. The conditional processing may depend upon the relative sizes of the input and output of the specific operations to be performed. For example, if a current layer matrix of values is larger than a next layer matrix of values to be computed, then rows of a weight matrix may be used by the computer processors to compute the next layer matrix. If the current layer matrix is smaller than the next layer matrix, then columns of the weight matrix may be used by the computer processors to compute the next layer matrix.
-
公开(公告)号:US09672122B1
公开(公告)日:2017-06-06
申请号:US14500762
申请日:2014-09-29
Applicant: Amazon Technologies, Inc.
Inventor: Mohana Sudhan Gandhi , Rejith George Joseph , Bandish N. Chheda , Saurabh Dileep Baji
CPC classification number: G06F11/1425 , G06F9/5088 , G06F11/1438 , G06F11/1484 , G06F11/203 , G06F11/2035 , G06F11/2041 , G06F11/2048 , G06F11/3433 , G06F17/30215 , G06F2201/805 , G06F2201/84
Abstract: Data files in a distributed system sometimes becomes unavailable. A method for fault tolerance without data loss in a distributed file system includes allocating data nodes of the distributed file system among a plurality of compute groups, replicating a data file among a subset of the plurality of the compute groups such that the data file is located in at least two compute zones, wherein the first compute zone is isolated from the second compute zone, monitoring the accessibility of the data files, and causing a distributed task requiring data in the data file to be executed by a compute instance in the subset of the plurality of the compute groups. Upon detecting a failure in the accessibility of a data node with the data file, the task management node may redistribute the distributed task among other compute instances with access to any replica of the data file.
-
公开(公告)号:US10659523B1
公开(公告)日:2020-05-19
申请号:US14286724
申请日:2014-05-23
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Tin-Yu Lee , Scott Michael Le Grand , Saurabh Dileep Baji
Abstract: At the request of a customer, a distributed computing service provider may create multiple clusters under a single customer account, and may isolate them from each other. For example, various isolation mechanisms (or combinations of isolation mechanisms) may be applied when creating the clusters to isolate a given cluster of compute nodes from network traffic from compute nodes of other clusters (e.g., by creating the clusters in different VPCs); to restrict access to data, metadata, or resources that are within the given cluster of compute nodes or that are associated with the given cluster of compute nodes by compute nodes of other clusters in the distributed computing system (e.g., using an instance metadata tag and/or a storage system prefix); and/or restricting access to application programming interfaces of the distributed computing service by the given cluster of compute nodes (e.g., using an identity and access manager).
-
公开(公告)号:US10650432B1
公开(公告)日:2020-05-12
申请号:US15362585
申请日:2016-11-28
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Oleg Rybakov
Abstract: Some aspects of the present disclosure relate to generating and training a neural network by separating historical item interaction data into both inputs and outputs. This may be done, for example, based on date. For example, a neural network machine learning technique may be used to generate a prediction model using a set of inputs that includes both a number of items purchased by a number of users before a certain date as well as some or all attributes of those items, and a set of outputs that includes the items purchased after that date. The items purchased before that date and the associated attributes can be subjected to a time-decay function.
-
-
-
-
-
-
-
-
-