Patent search ap:("DeepMind Technologies Limited") AND inv:"Daniel J. Mankowitz" Page 1

1.

发明申请
CONTINUAL REINFORCEMENT LEARNING WITH A MULTI-TASK AGENT 审中-公开

公开(公告)号：US20190244099A1

公开(公告)日：2019-08-08

申请号：US16268414

申请日：2019-02-05

Applicant: DeepMind Technologies Limited

Inventor： Tom Schaul , Matteo Hessel , Hado Philip van Hasselt , Daniel J. Mankowitz

IPC: G06N3/08 , G06N3/04 , G05D1/00

CPC classification number: G06N3/08 , G05D1/0088 , G06N3/04

Abstract: A method of training an action selection neural network for controlling an agent interacting with an environment to perform different tasks is described. The method includes obtaining a first trajectory of transitions generated while the agent was performing an episode of the first task from multiple tasks; and training the action selection neural network on the first trajectory to adjust the control policies for the multiple tasks. The training includes, for each transition in the first trajectory: generating respective policy outputs for the initial observation in the transition for each task in a subset of tasks that includes the first task and one other task; generating respective target policy outputs for each task using the reward in the transition, and determining an update to the current parameter values based on, for each task, a gradient of a loss between the policy output and the target policy output for the task.

2.

发明授权
Continual reinforcement learning with a multi-task agent 有权

公开(公告)号：US12154029B2

公开(公告)日：2024-11-26

申请号：US16268414

申请日：2019-02-05

Applicant: DeepMind Technologies Limited

Inventor： Tom Schaul , Matteo Hessel , Hado Philip van Hasselt , Daniel J. Mankowitz

IPC: G06N3/08 , G06N3/04

Abstract: A method of training an action selection neural network for controlling an agent interacting with an environment to perform different tasks is described. The method includes obtaining a first trajectory of transitions generated while the agent was performing an episode of the first task from multiple tasks; and training the action selection neural network on the first trajectory to adjust the control policies for the multiple tasks. The training includes, for each transition in the first trajectory: generating respective policy outputs for the initial observation in the transition for each task in a subset of tasks that includes the first task and one other task; generating respective target policy outputs for each task using the reward in the transition, and determining an update to the current parameter values based on, for each task, a gradient of a loss between the policy output and the target policy output for the task.

3.

发明公开
TRAINING RATE CONTROL NEURAL NETWORKS THROUGH REINFORCEMENT LEARNING 审中-公开

公开(公告)号：US20240267532A1

公开(公告)日：2024-08-08

申请号：US18565008

申请日：2022-05-30

Applicant: DeepMind Technologies Limited

Inventor： Anton Zhernov , Chenjie Gu , Daniel J. Mankowitz , Julian Schrittwieser , Amol Balkishan Mandhane , Mary Elizabeth Rauh , Miaosen Wang , Thomas Keisuke Hubert

IPC: H04N19/149 , H04N19/172

CPC classification number: H04N19/149 , H04N19/172

Abstract: Systems and methods for training rate control neural networks through reinforcement learning. During training, reward values for training examples are generated from the current performance of the rate control neural network in encoding the video in the training example and the historical performance of the rate control neural network in encoding the video in the training example.

4.

发明申请
ROBUST REINFORCEMENT LEARNING FOR CONTINUOUS CONTROL WITH MODEL MISSPECIFICATION 有权

公开(公告)号：US20220343157A1

公开(公告)日：2022-10-27

申请号：US17620164

申请日：2020-06-17

Applicant: DEEPMIND TECHNOLOGIES LIMITED

Inventor： Daniel J. Mankowitz , Nir Levine , Rae Chan Jeong , Abbas Abdolmaleki , Jost Tobias Springenberg , Todd Andrew Hester , Timothy Arthur Mann , Martin Riedmiller

IPC: G06N3/08 , G06N3/04

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes sampling a mini-batch comprising one or more observation-action-reward tuples generated as a result of interactions of a first agent with a first environment; determining an update to current values of the Q network parameters by minimizing a robust entropy-regularized temporal difference (TD) error that accounts for possible perturbations of the states of the first environment represented by the observations in the observation-action-reward tuples; and determining, using the Q-value neural network, an update to the policy network parameters using the sampled mini-batch of observation-action-reward tuples.

Patent Agency Ranking