SAMPLE-EFFICIENT REINFORCEMENT LEARNING

    公开(公告)号:US20210201156A1

    公开(公告)日:2021-07-01

    申请号:US17056640

    申请日:2019-05-20

    Applicant: GOOGLE LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sample-efficient reinforcement learning. One of the methods includes maintaining an ensemble of Q networks, an ensemble of transition models, and an ensemble of reward models; obtaining a transition; generating, using the ensemble of transition models, M trajectories; for each time step in each of the trajectories: generating, using the ensemble of reward models, N rewards for the time step, generating, using the ensemble of Q networks, L Q values for the time step, and determining, from the rewards, the Q values, and the training reward, L*N candidate target Q values for the trajectory and for the time step; for each of the time steps, combining the candidate target Q values; determining a final target Q value; and training at least one of the Q networks in the ensemble using the final target Q value.

    BATCHED REINFORCEMENT LEARNING
    2.
    发明申请

    公开(公告)号:US20200234117A1

    公开(公告)日:2020-07-23

    申请号:US16617461

    申请日:2018-08-24

    Applicant: GOOGLE LLC

    Inventor: Danijar Hafner

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for batched reinforcement learning. For example, the batched reinforcement learning techniques can be used to determine a control policy for a robot in simulation and the control policy can then be used to control the robot in the real world. In one aspect, a method includes obtaining a plurality of current observations, each current observation characterizing a current state of a respective environment replica; processing the current observations in parallel using the action selection neural network in accordance with current values of the network parameters to generate an action batch; obtaining a transition tuple batch comprising a respective transition tuple for each of the environment replicas, the respective transition tuple for each environment replica comprising: (i) a subsequent observation and (ii) a reward; and training the action selection neural network on the batch of transition tuples.

    SYSTEM AND METHODS FOR PIXEL BASED MODEL PREDICTIVE CONTROL

    公开(公告)号:US20240173854A1

    公开(公告)日:2024-05-30

    申请号:US18436684

    申请日:2024-02-08

    Applicant: GOOGLE LLC

    Inventor: Danijar Hafner

    CPC classification number: B25J9/161 B25J9/163 B25J9/1661 G06N7/01

    Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.

    SYSTEM AND METHODS FOR PIXEL BASED MODEL PREDICTIVE CONTROL

    公开(公告)号:US20210205984A1

    公开(公告)日:2021-07-08

    申请号:US17056104

    申请日:2019-05-17

    Applicant: Google LLC

    Inventor: Danijar Hafner

    Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.

    System and methods for pixel based model predictive control

    公开(公告)号:US11904467B2

    公开(公告)日:2024-02-20

    申请号:US17056104

    申请日:2019-05-17

    Applicant: Google LLC

    Inventor: Danijar Hafner

    CPC classification number: B25J9/161 B25J9/163 B25J9/1661 G06N7/01

    Abstract: Techniques are disclosed that enable model predictive control of a robot based on a latent dynamics model and a reward function. In many implementations, the latent space can be divided into a deterministic portion and stochastic portion, allowing the model to be utilized in generating more likely robot trajectories. Additional or alternative implementations include many reward functions, where each reward function corresponds to a different robot task.

    TRAINING REINFORCEMENT LEARNING AGENTS TO LEARN FARSIGHTED BEHAVIORS BY PREDICTING IN LATENT SPACE

    公开(公告)号:US20210158162A1

    公开(公告)日:2021-05-27

    申请号:US17103827

    申请日:2020-11-24

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection policy neural network used to select an action to be performed by an agent interacting with an environment. In one aspect, a method includes: receiving a latent representation characterizing a current state of the environment; generating a trajectory of latent representations that starts with the received latent representation; for each latent representation in the trajectory: determining a predicted reward; and processing the state latent representation using a value neural network to generate a predicted state value; determining a corresponding target state value for each latent representation in the trajectory; determining, based on the target state values, an update to the current values of the policy neural network parameters; and determining an update to the current values of the value neural network parameters.

Patent Agency Ranking