IMITATION LEARNING BASED ON PREDICTION OF OUTCOMES

    公开(公告)号:US20240185082A1

    公开(公告)日:2024-06-06

    申请号:US18275722

    申请日:2022-02-04

    CPC classification number: G06N3/092

    Abstract: A method is proposed of training a policy model to generate action data for controlling an agent to perform a task in an environment. The method comprises: obtaining, for each of a plurality of performances of the task, a corresponding demonstrator trajectory comprising a plurality of sets of state data characterizing the environment at each of a plurality of corresponding successive time steps during the performance of the task; using the demonstrator trajectories to generate a demonstrator model, the demonstrator model being operative to generate, for any said demonstrator trajectory, a value indicative of the probability of the demonstrator trajectory occurring; and jointly training an imitator model and a policy model. The joint training is performed by: generating a plurality of imitation trajectories, each imitation trajectory being generated by repeatedly receiving state data indicating a state of the environment, using the policy model to generate action data indicative of an action, and causing the action to be performed by the agent; training the imitator model using the imitation trajectories, the imitator model being operative to generate, for any said imitation trajectory, a value indicative of the probability of the imitation trajectory occurring; and training the policy model using a reward function which is a measure of the similarity of the demonstrator model and the imitator model.

Patent Agency Ranking