-
公开(公告)号:US20210089908A1
公开(公告)日:2021-03-25
申请号:US17032562
申请日:2020-09-25
Applicant: DeepMind Technologies Limited
Inventor: Tom Schaul , Diana Luiza Borsa , Fengning Ding , David Szepesvari , Georg Ostrovski , Simon Osindero , William Clinton Dabney
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in a set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation.
-
公开(公告)号:US12061964B2
公开(公告)日:2024-08-13
申请号:US17032562
申请日:2020-09-25
Applicant: DeepMind Technologies Limited
Inventor: Tom Schaul , Diana Luiza Borsa , Fengning Ding , David Szepesvari , Georg Ostrovski , Simon Osindero , William Clinton Dabney
IPC: G06N3/006 , G06F18/214 , G06F18/2415 , G06N3/08 , G06V10/764 , G06V10/82 , G06V40/20
CPC classification number: G06N3/006 , G06F18/2148 , G06F18/2415 , G06N3/08 , G06V10/764 , G06V10/82 , G06V40/20
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for controlling an agent. One of the methods includes sampling a behavior modulation in accordance with a current probability distribution; for each of one or more time steps: processing an input comprising an observation characterizing a current state of the environment at the time step using an action selection neural network to generate a respective action score for each action in a set of possible actions that can be performed by the agent; modifying the action scores using the sampled behavior modulation; and selecting the action to be performed by the agent at the time step based on the modified action scores; determining a fitness measure corresponding to the sampled behavior modulation; and updating the current probability distribution over the set of possible behavior modulations using the fitness measure corresponding to the behavior modulation.
-