-
公开(公告)号:US20180032864A1
公开(公告)日:2018-02-01
申请号:US15280784
申请日:2016-09-29
Applicant: Google Inc.
Inventor: Thore Kurt Hartwig Graepel , Shih-Chieh Huang , David Silver , Arthur Clement Guez , Laurent Sifre , Ilya Sutskever , Christopher Maddison
CPC classification number: G06N3/08 , G06F16/9027 , G06N3/0427 , G06N3/0454 , G06N5/003 , G16B40/00 , G16H50/20
Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a value neural network that is configured to receive an observation characterizing a state of an environment being interacted with by an agent and to process the observation in accordance with parameters of the value neural network to generate a value score. One of the systems performs operations that include training a supervised learning policy neural network; initializing initial values of parameters of a reinforcement learning policy neural network having a same architecture as the supervised learning policy network to the trained values of the parameters of the supervised learning policy neural network; training the reinforcement learning policy neural network on second training data; and training the value neural network to generate a value score for the state of the environment that represents a predicted long-term reward resulting from the environment being in the state.
-
公开(公告)号:US20180032863A1
公开(公告)日:2018-02-01
申请号:US15280711
申请日:2016-09-29
Applicant: Google Inc.
Inventor: Thore Kurt Hartwig Graepel , Shih-Chieh Huang , David Silver , Arthur Clement Guez , Laurent Sifre , Ilya Sutskever , Christopher Maddison
CPC classification number: G06N3/08 , G05B13/027 , G06N3/04 , G06N3/0427 , G06N3/0454 , G16B40/00 , G16H50/20
Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a value neural network that is configured to receive an observation characterizing a state of an environment being interacted with by an agent and to process the observation in accordance with parameters of the value neural network to generate a value score. One of the systems performs operations that include training a supervised learning policy neural network; initializing initial values of parameters of a reinforcement learning policy neural network having a same architecture as the supervised learning policy network to the trained values of the parameters of the supervised learning policy neural network; training the reinforcement learning policy neural network on second training data; and training the value neural network to generate a value score for the state of the environment that represents a predicted long-term reward resulting from the environment being in the state.
-