Patent search ap:("Okinawa Institute of Science AND Technology School Corporation") AND inv:"Kenji Doya" Page 1

1.

发明申请
RECORDING MEDIUM, REINFORCEMENT LEARNING METHOD, AND REINFORCEMENT LEARNING APPARATUS 审中-公开

公开(公告)号：US20190087751A1

公开(公告)日：2019-03-21

申请号：US16130482

申请日：2018-09-13

Applicant: FUJITSU LIMITED , Okinawa Institute of Science and Technology School Corporation

Inventor： Tomotake Sasaki , Eiji Uchibe , Kenji Doya , Hirokazu Anai , Hitoshi Yanami , Hidenao Iwane

IPC: G06N99/00 , H04L12/24

Abstract: A non-transitory, computer-readable recording medium stores therein a reinforcement learning program that uses a value function and causes a computer to execute a process comprising: estimating first coefficients of the value function represented in a quadratic form of inputs at times in the past than a present time and outputs at the present time and the times in the past, the first coefficients being estimated based on inputs at the times in the past, the outputs at the present time and the times in the past, and costs or rewards that corresponds to the inputs at the times in the past; and determining second coefficients that defines a control law, based on the value function that uses the estimated first coefficients and determining input values at times after estimation of the first coefficients.

2.

发明授权
Recording medium, reinforcement learning method, and reinforcement learning apparatus 有权

公开(公告)号：US11645574B2

公开(公告)日：2023-05-09

申请号：US16130482

申请日：2018-09-13

Applicant: FUJITSU LIMITED , Okinawa Institute of Science and Technology School Corporation

Inventor： Tomotake Sasaki , Eiji Uchibe , Kenji Doya , Hirokazu Anai , Hitoshi Yanami , Hidenao Iwane

IPC: G06N3/00 , G06N20/00 , H04L41/0816 , G06N3/006 , H04L41/16 , H04L43/08

CPC classification number: G06N20/00 , G06N3/006 , H04L41/0816 , H04L41/16 , H04L43/08

Abstract: A non-transitory, computer-readable recording medium stores therein a reinforcement learning program that uses a value function and causes a computer to execute a process comprising: estimating first coefficients of the value function represented in a quadratic form of inputs at times in the past than a present time and outputs at the present time and the times in the past, the first coefficients being estimated based on inputs at the times in the past, the outputs at the present time and the times in the past, and costs or rewards that corresponds to the inputs at the times in the past; and determining second coefficients that defines a control law, based on the value function that uses the estimated first coefficients and determining input values at times after estimation of the first coefficients.

3.

发明授权
Inverse reinforcement learning by density ratio estimation 有权

公开(公告)号：US10896382B2

公开(公告)日：2021-01-19

申请号：US15329690

申请日：2015-08-07

Applicant: Okinawa Institute of Science and Technology School Corporation

Inventor： Eiji Uchibe , Kenji Doya

IPC: G06N20/00 , G06N7/00

Abstract: A method of inverse reinforcement learning for estimating cost and value functions of behaviors of a subject includes acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: q(x)+gV(y)−V(x)=−ln{pi(y|x))/(p(y|x)} (1) where q(x) and V(x) denote a cost function and a value function, respectively, at state x, g represents a discount factor, and p(y|x) and pi(y|x) denote state transition probabilities before and after learning, respectively; estimating a density ratio pi(y|x)/p(y|x) in Eq. (1); estimating q(x) and V(x) in Eq. (1) using the least square method in accordance with the estimated density ratio pi(y|x)/p(y|x), and outputting the estimated q(x) and V(x).

4.

发明授权
Apparatus, method and recording medium for controlling system using temporal difference error 有权

公开(公告)号：US11573537B2

公开(公告)日：2023-02-07

申请号：US16130469

申请日：2018-09-13

Applicant: FUJITSU LIMITED , Okinawa Institute of Science and Technology School Corporation

Inventor： Tomotake Sasaki , Eiji Uchibe , Kenji Doya , Hirokazu Anai , Hitoshi Yanami , Hidenao Iwane

IPC: G05B13/02 , G06F17/16 , G06N20/00 , G06N3/00 , G05B13/04

Abstract: A non-transitory, computer-readable recording medium stores a program of reinforcement learning by a state-value function. The program causes a computer to execute a process including calculating a temporal difference (TD) error based on an estimated state-value function, the TD error being calculated by giving a perturbation to each component of a feedback coefficient matrix that provides a policy; calculating based on the TD error and the perturbation, an estimated gradient function matrix acquired by estimating a gradient function matrix of the state-value function with respect to the feedback coefficient matrix for a state of a controlled object, when state variation of the controlled object in the reinforcement learning is described by a linear difference equation and an immediate cost or an immediate reward of the controlled object is described in a quadratic form of the state and an input; and updating the feedback coefficient matrix using the estimated gradient function matrix.

5.

发明授权
Direct inverse reinforcement learning with density ratio estimation 有权

公开(公告)号：US10896383B2

公开(公告)日：2021-01-19

申请号：US15425924

申请日：2017-02-06

Applicant: Okinawa Institute of Science and Technology School Corporation

Inventor： Eiji Uchibe , Kenji Doya

IPC: G06N20/00 , G06N7/00 , G06K9/62

Abstract: A method of inverse reinforcement learning for estimating reward and value functions of behaviors of a subject includes: acquiring data representing changes in state variables that define the behaviors of the subject; applying a modified Bellman equation given by Eq. (1) to the acquired data: r ⁡ ( x ) + γ ⁢ ⁢ V ⁡ ( y ) - V ⁡ ( x ) = ⁢ ln ⁢ ⁢ π ⁡ ( y | x ) b ⁡ ( y | x ) , ⁢ ( 1 ) = ⁢ ln ⁢ ⁢ π ⁡ ( x , y ) b ⁡ ( x , y ) - ln ⁢ ⁢ π ⁡ ( x ) b ⁡ ( x ) , ⁢ ( 2 ) where r(x) and V(x) denote a reward function and a value function, respectively, at state x, and γ represents a discount factor, and b(y|x) and π(y|x) denote state transition probabilities before and after learning, respectively; estimating a logarithm of the density ratio π(x)/b(x) in Eq. (2); estimating r(x) and V(x) in Eq. (2) from the result of estimating a log of the density ratio π(x,y)/b(x,y); and outputting the estimated r(x) and V(x).

6.

发明申请
RECORDING MEDIUM, POLICY IMPROVING METHOD, AND POLICY IMPROVING APPARATUS 审中-公开

公开(公告)号：US20190086876A1

公开(公告)日：2019-03-21

申请号：US16130469

申请日：2018-09-13

Applicant: FUJITSU LIMITED , Okinawa Institute of Science and Technology School Corporation

Inventor： Tomotake Sasaki , Eiji Uchibe , Kenji Doya , Hirokazu Anai , Hitoshi Yanami , Hidenao Iwane

IPC: G05B13/02 , G06F15/18 , G06F17/16

Abstract: A non-transitory, computer-readable recording medium stores a program of reinforcement learning by a state-value function. The program causes a computer to execute a process including calculating a TD error based on an estimated state-value function, the TD error being calculated by giving a perturbation to each component of a feedback coefficient matrix that provides a policy; calculating based on the TD error and the perturbation, an estimated gradient function matrix acquired by estimating a gradient function matrix of the state-value function with respect to the feedback coefficient matrix for a state of a controlled object, when state variation of the controlled object in the reinforcement learning is described by a linear difference equation and an immediate cost or an immediate reward of the controlled object is described in a quadratic form of the state and an input; and updating the feedback coefficient matrix using the estimated gradient function matrix.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification