Inverse reinforcement learning by density ratio estimation

    公开(公告)号:US10896382B2

    公开(公告)日:2021-01-19

    申请号:US15329690

    申请日:2015-08-07

    Abstract: A method of inverse reinforcement learning for estimating cost and value functions of behaviors of a subject includes acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: q(x)+gV(y)−V(x)=−ln{pi(y|x))/(p(y|x)} (1) where q(x) and V(x) denote a cost function and a value function, respectively, at state x, g represents a discount factor, and p(y|x) and pi(y|x) denote state transition probabilities before and after learning, respectively; estimating a density ratio pi(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x) in Eq. (1) using the least square method in accordance with the estimated density ratio pi(y|x)/p(y|x), and outputting the estimated q(x) and V(x).

    Direct inverse reinforcement learning with density ratio estimation

    公开(公告)号:US10896383B2

    公开(公告)日:2021-01-19

    申请号:US15425924

    申请日:2017-02-06

    Abstract: A method of inverse reinforcement learning for estimating reward and value functions of behaviors of a subject includes: acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: r ⁡ ( x ) + γ ⁢ ⁢ V ⁡ ( y ) - V ⁡ ( x ) = ⁢ ln ⁢ ⁢ π ⁡ ( y | x ) b ⁡ ( y | x ) , ⁢ ( 1 ) = ⁢ ln ⁢ ⁢ π ⁡ ( x , y ) b ⁡ ( x , y ) - ln ⁢ ⁢ π ⁡ ( x ) b ⁡ ( x ) ,                                                ⁢ ( 2 ) where r(x) and V(x) denote a reward function and a value function, respectively, at state x, and γ represents a discount factor, and b(y|x) and π(y|x) denote state transition probabilities before and after learning, respectively; estimating a logarithm of the density ratio π(x)/b(x) in Eq. (2); estimating r(x) and V(x) in Eq. (2) from the result of estimating a log of the density ratio π(x,y)/b(x,y); and outputting the estimated r(x) and V(x).

    RECORDING MEDIUM, POLICY IMPROVING METHOD, AND POLICY IMPROVING APPARATUS

    公开(公告)号:US20190086876A1

    公开(公告)日:2019-03-21

    申请号:US16130469

    申请日:2018-09-13

    Abstract: A non-transitory, computer-readable recording medium stores a program of reinforcement learning by a state-value function. The program causes a computer to execute a process including calculating a TD error based on an estimated state-value function, the TD error being calculated by giving a perturbation to each component of a feedback coefficient matrix that provides a policy; calculating based on the TD error and the perturbation, an estimated gradient function matrix acquired by estimating a gradient function matrix of the state-value function with respect to the feedback coefficient matrix for a state of a controlled object, when state variation of the controlled object in the reinforcement learning is described by a linear difference equation and an immediate cost or an immediate reward of the controlled object is described in a quadratic form of the state and an input; and updating the feedback coefficient matrix using the estimated gradient function matrix.

Patent Agency Ranking