-
公开(公告)号:US20210383218A1
公开(公告)日:2021-12-09
申请号:US17289514
申请日:2019-10-29
Applicant: Google LLC
Inventor: Tian Lu , Dale Eric Schuurmans , Craig Edgar Boutilier
IPC: G06N3/08
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining a control policy for an agent interacting with an environment. One of the methods includes updating the control policy using policy-consistent backups using Q learning. To determine a policy-consistent backup, the system determining a policy-consistent backup for the control policy at the current observation—current action pair, comprising: for each of a plurality of actions in a set of possible actions that can be performed by the agent, identifying Q values assigned by the control policy to next observation—action pairs by the control policy and justified by at least one of the information sets; pruning, from the identified Q values, any Q values that are justified only by information sets that are not policy-class consistent; and determining, from the reward and only the identified Q values that were not pruned, the policy-consistent backup.