-
公开(公告)号:US20230073326A1
公开(公告)日:2023-03-09
申请号:US17794797
申请日:2021-01-28
Applicant: DeepMind Technologies Limited
Inventor: Julian Schrittwieser , Ioannis Antonoglou , Thomas Keisuke Hubert
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting actions to be performed by an agent interacting with an environment to cause the agent to perform a task. One of the methods includes: receiving a current observation characterizing a current environment state of the environment; performing a plurality of planning iterations to generate plan data that indicates a respective value to performing the task of the agent performing each of the set of actions in the environment and starting from the current environment state, wherein performing each planning iteration comprises selecting a sequence of actions to be performed by the agent starting from the current environment state based on outputs generated by a dynamics model and a prediction model; and selecting, from the set of actions, an action to be performed by the agent in response to the current observation based on the plan data.
-
公开(公告)号:US20240104353A1
公开(公告)日:2024-03-28
申请号:US18274748
申请日:2022-02-08
Applicant: DeepMind Technologies Limited
Inventor: Rémi Bertrand Francis Leblond , Jean-Baptiste Alayrac , Laurent Sifre , Miruna Pîslar , Jean-Baptiste Lespiau , Ioannis Antonoglou , Karen Simonyan , David Silver , Oriol Vinyals
IPC: G06N3/0455
CPC classification number: G06N3/0455
Abstract: A computer-implemented method for generating an output token sequence from an input token sequence. The method combines a look ahead tree search, such as a Monte Carlo tree search, with a sequence-to-sequence neural network system. The sequence-to-sequence neural network system has a policy output defining a next token probability distribution, and may include a value neural network providing a value output to evaluate a sequence. An initial partial output sequence is extended using the look ahead tree search guided by the policy output and, in implementations, the value output, of the sequence-to-sequence neural network system until a complete output sequence is obtained.
-