Patent search ap:("DEEPMIND TECHNOLOGIES LIMITED") AND inv:"Marc Gendron-Bellemare" Page 2

11.

发明授权
Training action selection neural networks using leave-one-out-updates 有权

公开(公告)号：US11604997B2

公开(公告)日：2023-03-14

申请号：US16603307

申请日：2018-06-11

Applicant: DeepMind Technologies Limited

Inventor： Marc Gendron-Bellemare , Mohammad Gheshlaghi Azar , Audrunas Gruslys , Remi Munos

IPC: G06N3/08 , G06N3/04 , G06N3/084

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network. The policy neural network is used to select actions to be performed by an agent that interacts with an environment by receiving an observation characterizing a state of the environment and performing an action from a set of actions in response to the received observation. A trajectory is obtained from a replay memory, and a final update to current values of the policy network parameters is determined for each training observation in the trajectory. The final updates to the current values of the policy network parameters are determined from selected action updates and leave-one-out updates.

12.

发明授权
Evaluating reinforcement learning policies 有权

公开(公告)号：US11429898B1

公开(公告)日：2022-08-30

申请号：US16601547

申请日：2019-10-14

Applicant: DeepMind Technologies Limited

Inventor： Joel William Veness , Marc Gendron-Bellemare

IPC: G06N20/00 , G06N5/02

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for evaluating reinforcement learning policies. One of the methods includes receiving a plurality of training histories for a reinforcement learning agent; determining a total reward for each training observation in the training histories; partitioning the training observations into a plurality of partitions; determining, for each partition and from the partitioned training observations, a probability that the reinforcement learning agent will receive the total reward for the partition if the reinforcement learning agent performs the action for the partition in response to receiving the current observation; determining, from the probabilities and for each total reward, a respective estimated value of performing each action in response to receiving the current observation; and selecting an action from the pre-determined set of actions from the estimated values in accordance with an action selection policy.

13.

发明申请
TRAINING ACTION SELECTION NEURAL NETWORKS 有权

公开(公告)号：US20210110271A1

公开(公告)日：2021-04-15

申请号：US16603307

申请日：2018-06-11

Applicant: DeepMind Technologies Limited

Inventor： Marc Gendron-Bellemare , Mohammad Gheshlaghi Azar , Audrunas Gruslys , Remi Munos

IPC: G06N3/08 , G06N3/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a policy neural network. The policy neural network is used to select actions to be performed by an agent that interacts with an environment by receiving an observation characterizing a state of the environment and performing an action from a set of actions in response to the received observation. A trajectory is obtained from a replay memory, and a final update to current values of the policy network parameters is determined for each training observation in the trajectory. The final updates to the current values of the policy network parameters are determined from selected action updates and leave-one-out updates.

14.

发明申请
DISTRIBUTIONAL REINFORCEMENT LEARNING 审中-公开

公开(公告)号：US20190332923A1

公开(公告)日：2019-10-31

申请号：US16508046

申请日：2019-07-10

Applicant: DeepMind Technologies Limited

Inventor： Marc Gendron-Bellemare , William Clinton Dabney

IPC: G06N3/04 , G06N3/08 , G06F17/18

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for selecting an action to be performed by a reinforcement learning agent interacting with an environment. A current observation characterizing a current state of the environment is received. For each action in a set of multiple actions that can be performed by the agent to interact with the environment, a probability distribution is determined over possible Q returns for the action-current observation pair. For each action, a measure of central tendency of the possible Q returns with respect to the probability distributions for the action-current observation pair is determined. An action to be performed by the agent in response to the current observation is selected using the measures of central tendency.

Patent Agency Ranking