-
公开(公告)号:US20230082326A1
公开(公告)日:2023-03-16
申请号:US17797203
申请日:2021-02-08
Applicant: DeepMind Technologies Limited
Inventor: Abbas Abdolmaleki , Sandy Han Huang
IPC: G06N3/08
Abstract: There is provided a method for training a neural network system by reinforcement learning, the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives. The method comprises obtaining a set of one or more trajectories. Each trajectory comprises a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives. The method further comprises determining an action-value function for each of the plurality of objectives based on the set of one or more trajectories. Each action-value function determines an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy. The method further comprises determining an updated policy based on a combination of the action-value functions for the plurality of objectives.
-
2.
公开(公告)号:US20230368037A1
公开(公告)日:2023-11-16
申请号:US18029992
申请日:2021-10-01
Applicant: DeepMind Technologies Limited
Inventor: Sandy Han Huang , Abbas Abdolmaleki
IPC: G06N3/092
CPC classification number: G06N3/092
Abstract: A system and method that controls an agent to perform a task subject to one or more constraints. The system trains a preference neural network that learns which preferences produce constraint-satisfying action selection policies. Thus the system optimizes a hierarchical policy that is a product of a preference policy and a preference-conditioned action selection policy. Thus the system learns to jointly optimize a set of objectives relating to rewards and costs received during the task whilst also learning preferences, i.e. trade-offs between the rewards and costs, that are most likely to produce policies that satisfy the constraints.
-
公开(公告)号:US20240185084A1
公开(公告)日:2024-06-06
申请号:US18286504
申请日:2022-05-27
Applicant: DeepMind Technologies Limited
Inventor: Abbas Abdolmaleki , Sandy Han Huang , Martin Riedmiller
IPC: G06N3/092
CPC classification number: G06N3/092
Abstract: Computer implemented systems and methods for training an action selection policy neural network to select actions to be performed by an agent to control the agent to perform a task. The techniques are able to optimize multiple objectives one of which may be to stay close to a behavioral policy of a teacher. The behavioral policy of the teacher may be defined by a predetermined dataset of behaviors and the systems and methods may then learn offline. The described techniques provide a mechanism for explicitly defining a trade-off between the multiple objectives.
-
-