-
公开(公告)号:US20210201156A1
公开(公告)日:2021-07-01
申请号:US17056640
申请日:2019-05-20
Applicant: GOOGLE LLC
Inventor: Danijar Hafner , Jacob Buckman , Honglak Lee , Eugene Brevdo , George Jay Tucker
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sample-efficient reinforcement learning. One of the methods includes maintaining an ensemble of Q networks, an ensemble of transition models, and an ensemble of reward models; obtaining a transition; generating, using the ensemble of transition models, M trajectories; for each time step in each of the trajectories: generating, using the ensemble of reward models, N rewards for the time step, generating, using the ensemble of Q networks, L Q values for the time step, and determining, from the rewards, the Q values, and the training reward, L*N candidate target Q values for the trajectory and for the time step; for each of the time steps, combining the candidate target Q values; determining a final target Q value; and training at least one of the Q networks in the ensemble using the final target Q value.
-
公开(公告)号:US20200234117A1
公开(公告)日:2020-07-23
申请号:US16617461
申请日:2018-08-24
Applicant: GOOGLE LLC
Inventor: Danijar Hafner
IPC: G06N3/08 , G06N5/04 , G06F16/901
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for batched reinforcement learning. For example, the batched reinforcement learning techniques can be used to determine a control policy for a robot in simulation and the control policy can then be used to control the robot in the real world. In one aspect, a method includes obtaining a plurality of current observations, each current observation characterizing a current state of a respective environment replica; processing the current observations in parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch; obtaining a transition tuple batch comprising a respective transition tuple for each of the environment replicas, the respective transition tuple for each environment replica comprising: (i) a subsequent observation and (ii) a reward; and training the action selection neural network on the batch of transition tuples.
-
公开(公告)号:US20240173854A1
公开(公告)日:2024-05-30
申请号:US18436684
申请日:2024-02-08
Applicant: GOOGLE LLC
Inventor: Danijar Hafner
CPC classification number: B25J9/161 , B25J9/163 , B25J9/1661 , G06N7/01
Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.
-
公开(公告)号:US20210205984A1
公开(公告)日:2021-07-08
申请号:US17056104
申请日:2019-05-17
Applicant: Google LLC
Inventor: Danijar Hafner
Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.
-
公开(公告)号:US11904467B2
公开(公告)日:2024-02-20
申请号:US17056104
申请日:2019-05-17
Applicant: Google LLC
Inventor: Danijar Hafner
CPC classification number: B25J9/161 , B25J9/163 , B25J9/1661 , G06N7/01
Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.
-
6.
公开(公告)号:US20210158162A1
公开(公告)日:2021-05-27
申请号:US17103827
申请日:2020-11-24
Applicant: Google LLC
Inventor: Danijar Hafner , Mohammad Norouzi , Timothy Paul Lillicrap
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection policy neural network used to select an action to be performed by an agent interacting with an environment. In one aspect, a method includes: receiving a latent representation characterizing a current state of the environment; generating a trajectory of latent representations that starts with the received latent representation; for each latent representation in the trajectory: determining a predicted reward; and processing the state latent representation using a value neural network to generate a predicted state value; determining a corresponding target state value for each latent representation in the trajectory; determining, based on the target state values, an update to the current values of the policy neural network parameters; and determining an update to the current values of the value neural network parameters.
-
-
-
-
-