TRAINING AN ACTION SELECTION SYSTEM USING RELATIVE ENTROPY Q-LEARNING

    公开(公告)号:US20230214649A1

    公开(公告)日:2023-07-06

    申请号:US18008838

    申请日:2021-07-27

    CPC classification number: G06N3/08

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training an action selection system using reinforcement learning techniques. In one aspect, a method comprises at each of multiple iterations: obtaining a batch of experience, each experience tuple comprising: a first observation, an action, a second observation, and a reward; for each experience tuple, determining a state value for the second observation, comprising: processing the first observation using a policy neural network to generate an action score for each action in a set of possible actions; sampling multiple actions from the set of possible actions in accordance with the action scores; processing the second observation using a Q neural network to generate a Q value for each sampled action; and determining the state value for the second observation; and determining an update to current values of the Q neural network parameters using the state values.

    HIERARCHICAL POLICIES FOR MULTITASK TRANSFER

    公开(公告)号:US20220237488A1

    公开(公告)日:2022-07-28

    申请号:US17613687

    申请日:2020-05-22

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes obtaining an observation characterizing a current state of the environment and data identifying a task currently being performed by the agent; processing the observation and the data identifying the task using a high-level controller to generate a high-level probability distribution that assigns a respective probability to each of a plurality of low-level controllers; processing the observation using each of the plurality of low-level controllers to generate, for each of the plurality of low-level controllers, a respective low-level probability distribution; generating a combined probability distribution; and selecting, using the combined probability distribution, an action from the space of possible actions to be performed by the agent in response to the observation.

Patent Agency Ranking