MULTI-OBJECTIVE REINFORCEMENT LEARNING USING WEIGHTED POLICY PROJECTION

    公开(公告)号:US20240185084A1

    公开(公告)日:2024-06-06

    申请号:US18286504

    申请日:2022-05-27

    CPC classification number: G06N3/092

    Abstract: Computer implemented systems and methods for training an action selection policy neural network to select actions to be performed by an agent to control the agent to perform a task. The techniques are able to optimize multiple objectives one of which may be to stay close to a behavioral policy of a teacher. The behavioral policy of the teacher may be defined by a predetermined dataset of behaviors and the systems and methods may then learn offline. The described techniques provide a mechanism for explicitly defining a trade-off between the multiple objectives.

    HIERARCHICAL POLICIES FOR MULTITASK TRANSFER

    公开(公告)号:US20220237488A1

    公开(公告)日:2022-07-28

    申请号:US17613687

    申请日:2020-05-22

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes obtaining an observation characterizing a current state of the environment and data identifying a task currently being performed by the agent; processing the observation and the data identifying the task using a high-level controller to generate a high-level probability distribution that assigns a respective probability to each of a plurality of low-level controllers; processing the observation using each of the plurality of low-level controllers to generate, for each of the plurality of low-level controllers, a respective low-level probability distribution; generating a combined probability distribution; and selecting, using the combined probability distribution, an action from the space of possible actions to be performed by the agent in response to the observation.

    CONSTRAINED REINFORCEMENT LEARNING NEURAL NETWORK SYSTEMS USING PARETO FRONT OPTIMIZATION

    公开(公告)号:US20230368037A1

    公开(公告)日:2023-11-16

    申请号:US18029992

    申请日:2021-10-01

    CPC classification number: G06N3/092

    Abstract: A system and method that controls an agent to perform a task subject to one or more constraints. The system trains a preference neural network that learns which preferences produce constraint-satisfying action selection policies. Thus the system optimizes a hierarchical policy that is a product of a preference policy and a preference-conditioned action selection policy. Thus the system learns to jointly optimize a set of objectives relating to rewards and costs received during the task whilst also learning preferences, i.e. trade-offs between the rewards and costs, that are most likely to produce policies that satisfy the constraints.

    TRAINING MULTI-OBJECTIVE NEURAL NETWORK REINFORCEMENT LEARNING SYSTEMS

    公开(公告)号:US20230082326A1

    公开(公告)日:2023-03-16

    申请号:US17797203

    申请日:2021-02-08

    Abstract: There is provided a method for training a neural network system by reinforcement learning, the neural network system being configured to receive an input observation characterizing a state of an environment interacted with by an agent and to select and output an action in accordance with a policy that aims to satisfy a plurality of objectives. The method comprises obtaining a set of one or more trajectories. Each trajectory comprises a state of an environment, an action applied by the agent to the environment according to a previous policy in response to the state, and a set of rewards for the action, each reward relating to a corresponding objective of the plurality of objectives. The method further comprises determining an action-value function for each of the plurality of objectives based on the set of one or more trajectories. Each action-value function determines an action value representing an estimated return according to the corresponding objective that would result from the agent performing a given action in response to a given state according to the previous policy. The method further comprises determining an updated policy based on a combination of the action-value functions for the plurality of objectives.

Patent Agency Ranking