-
公开(公告)号:US20190244099A1
公开(公告)日:2019-08-08
申请号:US16268414
申请日:2019-02-05
Applicant: DeepMind Technologies Limited
Inventor: Tom Schaul , Matteo Hessel , Hado Philip van Hasselt , Daniel J. Mankowitz
CPC classification number: G06N3/08 , G05D1/0088 , G06N3/04
Abstract: A method of training an action selection neural network for controlling an agent interacting with an environment to perform different tasks is described. The method includes obtaining a first trajectory of transitions generated while the agent was performing an episode of the first task from multiple tasks; and training the action selection neural network on the first trajectory to adjust the control policies for the multiple tasks. The training includes, for each transition in the first trajectory: generating respective policy outputs for the initial observation in the transition for each task in a subset of tasks that includes the first task and one other task; generating respective target policy outputs for each task using the reward in the transition, and determining an update to the current parameter values based on, for each task, a gradient of a loss between the policy output and the target policy output for the task.
-
公开(公告)号:US12154029B2
公开(公告)日:2024-11-26
申请号:US16268414
申请日:2019-02-05
Applicant: DeepMind Technologies Limited
Inventor: Tom Schaul , Matteo Hessel , Hado Philip van Hasselt , Daniel J. Mankowitz
Abstract: A method of training an action selection neural network for controlling an agent interacting with an environment to perform different tasks is described. The method includes obtaining a first trajectory of transitions generated while the agent was performing an episode of the first task from multiple tasks; and training the action selection neural network on the first trajectory to adjust the control policies for the multiple tasks. The training includes, for each transition in the first trajectory: generating respective policy outputs for the initial observation in the transition for each task in a subset of tasks that includes the first task and one other task; generating respective target policy outputs for each task using the reward in the transition, and determining an update to the current parameter values based on, for each task, a gradient of a loss between the policy output and the target policy output for the task.
-
公开(公告)号:US20240267532A1
公开(公告)日:2024-08-08
申请号:US18565008
申请日:2022-05-30
Applicant: DeepMind Technologies Limited
Inventor: Anton Zhernov , Chenjie Gu , Daniel J. Mankowitz , Julian Schrittwieser , Amol Balkishan Mandhane , Mary Elizabeth Rauh , Miaosen Wang , Thomas Keisuke Hubert
IPC: H04N19/149 , H04N19/172
CPC classification number: H04N19/149 , H04N19/172
Abstract: Systems and methods for training rate control neural networks through reinforcement learning. During training, reward values for training examples are generated from the current performance of the rate control neural network in encoding the video in the training example and the historical performance of the rate control neural network in encoding the video in the training example.
-
公开(公告)号:US20220343157A1
公开(公告)日:2022-10-27
申请号:US17620164
申请日:2020-06-17
Applicant: DEEPMIND TECHNOLOGIES LIMITED
Inventor: Daniel J. Mankowitz , Nir Levine , Rae Chan Jeong , Abbas Abdolmaleki , Jost Tobias Springenberg , Todd Andrew Hester , Timothy Arthur Mann , Martin Riedmiller
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes sampling a mini-batch comprising one or more observation-action-reward tuples generated as a result of interactions of a first agent with a first environment; determining an update to current values of the Q network parameters by minimizing a robust entropy-regularized temporal difference (TD) error that accounts for possible perturbations of the states of the first environment represented by the observations in the observation-action-reward tuples; and determining, using the Q-value neural network, an update to the policy network parameters using the sampled mini-batch of observation-action-reward tuples.
-
-
-