Patent search ap:("Okinawa Institute of Science AND Technology School Corporation") AND inv:"Eiji UCHIBE" Page 1

1.

发明申请
DIRECT INVERSE REINFORCEMENT LEARNING WITH DENSITY RATIO ESTIMATION 审中-公开

公开(公告)号：US20170147949A1

公开(公告)日：2017-05-25

申请号：US15425924

申请日：2017-02-06

Applicant: Okinawa Institute of Science and Technology School Corporation

Inventor： Eiji UCHIBE , Kenji DOYA

IPC: G06N99/00 , G06N7/00 , G06F17/18

CPC classification number: G06N20/00 , G06K9/6297 , G06N7/005

Abstract: A method of inverse reinforcement learning for estimating reward and value functions of behaviors of a subject includes: acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: r  ( x ) + γ   V  ( y ) - V  ( x ) =  ln   π  ( y | x ) b  ( y | x ) ,  ( 1 ) =  ln   π  ( x , y ) b  ( x , y ) - ln   π  ( x ) b  ( x ) ,  ( 2 ) where r(x) and V(x) denote a reward function and a value function, respectively, at state x, and γ represents a discount factor, and b(y|x) and π(y|x) denote state transition probabilities before and after learning, respectively; estimating a logarithm of the density ratio π(x)/b(x) in Eq. (2); estimating r(x) and V(x) in Eq. (2) from the result of estimating a log of the density ratio π(x,y)/b(x,y); and outputting the estimated r(x) and V(x).

2.

发明申请
INVERSE REINFORCEMENT LEARNING BY DENSITY RATIO ESTIMATION 审中-公开

公开(公告)号：US20170213151A1

公开(公告)日：2017-07-27

申请号：US15329690

申请日：2015-08-07

Applicant: Okinawa Institute of Science and Technology School Corporation

Inventor： Eiji UCHIBE , Kenji DOYA

IPC: G06N99/00 , G06N7/00

CPC classification number: G06N20/00 , G06N7/005

Abstract: A method of inverse reinforcement learning for estimating cost and value functions of behaviors of a subject includes acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: q(x)+gV(y)−V(x)=−1n{pi(y|x))/(p(y|x)} (1) where q(x) and V(x) denote a cost function and a value function, respectively, at state x, g represents a discount factor, and p(y|x) and pi(y|x) denote state transition probabilities before and after learning, respectively; estimating a density ratio pi(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x) in Eq. (1) using the least square method in accordance with the estimated density ratio pi(y|x)/p(y|x), and outputting the estimated q(x) and V(x).

Patent Agency Ranking