IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES

Invention Publication

US20240185082A1 IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES 审中-公开

Please log in to see more content

Patent Title: IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES
Application No.: US18275722

Application Date: 2022-02-04
Publication No.: US20240185082A1

Publication Date: 2024-06-06
Inventor: Andrew Coulter Jaegle , Yury Sulsky , Gregory Duncan Wayne , Robert David Fergus
Applicant: DeepMind Technologies Limited
Applicant Address: GB London
Assignee: DeepMind Technologies Limited
Current Assignee: DeepMind Technologies Limited
Current Assignee Address: GB London
International Application: PCT/EP2022/052792 2022.02.04
Date entered country: 2023-08-03
Main IPC: G06N3/092
IPC: G06N3/092

IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES

Abstract:

A method is proposed of training a policy model to generate action data for controlling an agent to perform a task in an environment. The method comprises: obtaining, for each of a plurality of performances of the task, a corresponding demonstrator trajectory comprising a plurality of sets of state data characterizing the environment at each of a plurality of corresponding successive time steps during the performance of the task; using the demonstrator trajectories to generate a demonstrator model, the demonstrator model being operative to generate, for any said demonstrator trajectory, a value indicative of the probability of the demonstrator trajectory occurring; and jointly training an imitator model and a policy model. The joint training is performed by: generating a plurality of imitation trajectories, each imitation trajectory being generated by repeatedly receiving state data indicating a state of the environment, using the policy model to generate action data indicative of an action, and causing the action to be performed by the agent; training the imitator model using the imitation trajectories, the imitator model being operative to generate, for any said imitation trajectory, a value indicative of the probability of the imitation trajectory occurring; and training the policy model using a reward function which is a measure of the similarity of the demonstrator model and the imitator model.

Information query

Global Dossier Espacenet