-
公开(公告)号:US20240185084A1
公开(公告)日:2024-06-06
申请号:US18286504
申请日:2022-05-27
Applicant: DeepMind Technologies Limited
Inventor: Abbas Abdolmaleki , Sandy Han Huang , Martin Riedmiller
IPC: G06N3/092
CPC classification number: G06N3/092
Abstract: Computer implemented systems and methods for training an action selection policy neural network to select actions to be performed by an agent to control the agent to perform a task. The techniques are able to optimize multiple objectives one of which may be to stay close to a behavioral policy of a teacher. The behavioral policy of the teacher may be defined by a predetermined dataset of behaviors and the systems and methods may then learn offline. The described techniques provide a mechanism for explicitly defining a trade-off between the multiple objectives.
-
公开(公告)号:US20220237488A1
公开(公告)日:2022-07-28
申请号:US17613687
申请日:2020-05-22
Applicant: DeepMind Technologies Limited
Inventor: Markus Wulfmeier , Abbas Abdolmaleki , Roland Hafner , Jost Tobias Springenberg , Nicolas Manfred Otto Heess , Martin Riedmiller
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes obtaining an observation characterizing a current state of the environment and data identifying a task currently being performed by the agent; processing the observation and the data identifying the task using a high-level controller to generate a high-level probability distribution that assigns a respective probability to each of a plurality of low-level controllers; processing the observation using each of the plurality of low-level controllers to generate, for each of the plurality of low-level controllers, a respective low-level probability distribution; generating a combined probability distribution; and selecting, using the combined probability distribution, an action from the space of possible actions to be performed by the agent in response to the observation.
-
公开(公告)号:US20240220795A1
公开(公告)日:2024-07-04
申请号:US18401226
申请日:2023-12-29
Applicant: DeepMind Technologies Limited
Inventor: Jingwei Zhang , Arunkumar Byravan , Jost Tobias Springenberg , Martin Riedmiller , Nicolas Manfred Otto Heess , Leonard Hasenclever , Abbas Abdolmaleki , Dushyant Rao
IPC: G06N3/08
CPC classification number: G06N3/08
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for controlling agents using jumpy trajectory decoder neural networks.
-
4.
公开(公告)号:US20230368037A1
公开(公告)日:2023-11-16
申请号:US18029992
申请日:2021-10-01
Applicant: DeepMind Technologies Limited
Inventor: Sandy Han Huang , Abbas Abdolmaleki
IPC: G06N3/092
CPC classification number: G06N3/092
Abstract: A system and method that controls an agent to perform a task subject to one or more constraints. The system trains a preference neural network that learns which preferences produce constraint-satisfying action selection policies. Thus the system optimizes a hierarchical policy that is a product of a preference policy and a preference-conditioned action selection policy. Thus the system learns to jointly optimize a set of objectives relating to rewards and costs received during the task whilst also learning preferences, i.e. trade-offs between the rewards and costs, that are most likely to produce policies that satisfy the constraints.
-
公开(公告)号:US20230082326A1
公开(公告)日:2023-03-16
申请号:US17797203
申请日:2021-02-08
Applicant: DeepMind Technologies Limited
Inventor: Abbas Abdolmaleki , Sandy Han Huang
IPC: G06N3/08
Abstract: There is provided a method for training a neural network system by reinforcement learning, the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives. The method comprises obtaining a set of one or more trajectories. Each trajectory comprises a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives. The method further comprises determining an action-value function for each of the plurality of objectives based on the set of one or more trajectories. Each action-value function determines an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy. The method further comprises determining an updated policy based on a combination of the action-value functions for the plurality of objectives.
-
公开(公告)号:US20220343157A1
公开(公告)日:2022-10-27
申请号:US17620164
申请日:2020-06-17
Applicant: DEEPMIND TECHNOLOGIES LIMITED
Inventor: Daniel J. Mankowitz , Nir Levine , Rae Chan Jeong , Abbas Abdolmaleki , Jost Tobias Springenberg , Todd Andrew Hester , Timothy Arthur Mann , Martin Riedmiller
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a policy neural network having policy parameters. One of the methods includes sampling a mini-batch comprising one or more observation-action-reward tuples generated as a result of interactions of a first agent with a first environment; determining an update to current values of the Q network parameters by minimizing a robust entropy-regularized temporal difference (TD) error that accounts for possible perturbations of the states of the first environment represented by the observations in the observation-action-reward tuples; and determining, using the Q-value neural network, an update to the policy network parameters using the sampled mini-batch of observation-action-reward tuples.
-
7.
公开(公告)号:US10786900B1
公开(公告)日:2020-09-29
申请号:US16586846
申请日:2019-09-27
Applicant: DeepMind Technologies Limited
Inventor: Steven Bohez , Abbas Abdolmaleki
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining a control policy for a vehicles or other robot through the performance of a reinforcement learning simulation of the robot.
-
-
-
-
-
-