-
公开(公告)号:US20240412063A1
公开(公告)日:2024-12-12
申请号:US18698218
申请日:2022-10-05
Applicant: DeepMind Technologies Limited
Inventor: Oleg O. Sushkov , Todor Bozhinov Davchev , Jonathan Karl Scholz
IPC: G06N3/08
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a reinforcement learning system to select actions to be performed by an agent interacting with an environment to perform a particular task. In one aspect, one of the methods includes obtaining a training sequence comprising a respective training observations at each of a plurality of time steps; obtaining demonstration data comprising one or more demonstration sequences; generating a new training sequence from the training sequence and the demonstration data; and training the goal-conditioned policy neural network on the new training sequence through reinforcement learning.
-
公开(公告)号:US20230023189A1
公开(公告)日:2023-01-26
申请号:US17962008
申请日:2022-10-07
Applicant: DeepMind Technologies Limited
Inventor: Olivier Pietquin , Martin Riedmiller , Wang Fumin , Bilal Piot , Mel Vecerik , Todd Andrew Hester , Thomas Rothoerl , Thomas Lampe , Nicolas Manfred Otto Heess , Jonathan Karl Scholz
Abstract: An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.
-
公开(公告)号:US11868882B2
公开(公告)日:2024-01-09
申请号:US16624245
申请日:2018-06-28
Applicant: DEEPMIND TECHNOLOGIES LIMITED
Inventor: Olivier Claude Pietquin , Martin Riedmiller , Wang Fumin , Bilal Piot , Mel Vecerik , Todd Andrew Hester , Thomas Rothoerl , Thomas Lampe , Nicolas Manfred Otto Heess , Jonathan Karl Scholz
Abstract: An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.
-
公开(公告)号:US11836599B2
公开(公告)日:2023-12-05
申请号:US17331614
申请日:2021-05-26
Applicant: DeepMind Technologies Limited
Inventor: Richard Andrew Evans , Jim Gao , Michael C. Ryan , Gabriel Dulac-Arnold , Jonathan Karl Scholz , Todd Andrew Hester
CPC classification number: G06N3/045
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for improving operational efficiency within a data center by modeling data center performance and predicting power usage efficiency. An example method receives a state input characterizing a current state of a data center. For each data center setting slate, the state input and the data center setting slate are processed through an ensemble of machine learning models. Each machine learning model is configured to receive and process the state input and the data center setting slate to generate an efficiency score that characterizes a predicted resource efficiency of the data center if the data center settings defined by the data center setting slate are adopted t. The method selects, based on the efficiency scores for the data center setting slates, new values for the data center settings.
-
公开(公告)号:US11468321B2
公开(公告)日:2022-10-11
申请号:US16624245
申请日:2018-06-28
Applicant: DEEPMIND TECHNOLOGIES LIMITED
Inventor: Olivier Claude Pietquin , Martin Riedmiller , Wang Fumin , Bilal Piot , Mel Vecerik , Todd Andrew Hester , Thomas Rothoerl , Thomas Lampe , Nicolas Manfred Otto Heess , Jonathan Karl Scholz
Abstract: An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.
-
公开(公告)号:US20200272889A1
公开(公告)日:2020-08-27
申请号:US16863357
申请日:2020-04-30
Applicant: DeepMind Technologies Limited
Inventor: Richard Andrew Evans , Jim Gao , Michael C. Ryan , Gabriel Dulac-Arnold , Jonathan Karl Scholz , Todd Andrew Hester
IPC: G06N3/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for improving operational efficiency within a data center by modeling data center performance and predicting power usage efficiency. An example method receives a state input characterizing a current state of a data center. For each data center setting slate, the state input and the data center setting slate are processed through an ensemble of machine learning models. Each machine learning model is configured to receive and process the state input and the data center setting slate to generate an efficiency score that characterizes a predicted resource efficiency of the data center if the data center settings defined by the data center setting slate are adopted t. The method selects, based on the efficiency scores for the data center setting slates, new values for the data center settings.
-
公开(公告)号:US20200151562A1
公开(公告)日:2020-05-14
申请号:US16624245
申请日:2018-06-28
Applicant: DEEPMIND TECHNOLOGIES LIMITED
Inventor: Olivier Pietquin , Martin Riedmiller , Wang Fumin , Bilal Piot , Mel Vecerik , Todd Andrew Hester , Thomas Rothörl , Thomas Lampe , Nicolas Manfred Otto Heess , Jonathan Karl Scholz
Abstract: An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.
-
公开(公告)号:US20240042600A1
公开(公告)日:2024-02-08
申请号:US18331632
申请日:2023-06-08
Applicant: DeepMind Technologies Limited
Inventor: Serkan Cabi , Ziyu Wang , Alexander Novikov , Ksenia Konyushkova , Sergio Gomez Colmenarejo , Scott Ellison Reed , Misha Man Ray Denil , Jonathan Karl Scholz , Oleg O. Sushkov , Rae Chan Jeong , David Barker , David Budden , Mel Vecerik , Yusuf Aytar , Joao Ferdinando Gomes de Freitas
IPC: B25J9/16
CPC classification number: B25J9/161 , B25J9/163 , B25J9/1661
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for data-driven robotic control. One of the methods includes maintaining robot experience data; obtaining annotation data; training, on the annotation data, a reward model; generating task-specific training data for the particular task, comprising, for each experience in a second subset of the experiences in the robot experience data: processing the observation in the experience using the trained reward model to generate a reward prediction, and associating the reward prediction with the experience; and training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task.
-
公开(公告)号:US11886997B2
公开(公告)日:2024-01-30
申请号:US17962008
申请日:2022-10-07
Applicant: DeepMind Technologies Limited
Inventor: Olivier Pietquin , Martin Riedmiller , Wang Fumin , Bilal Piot , Mel Vecerik , Todd Andrew Hester , Thomas Rothoerl , Thomas Lampe , Nicolas Manfred Otto Heess , Jonathan Karl Scholz
Abstract: An off-policy reinforcement learning actor-critic neural network system configured to select actions from a continuous action space to be performed by an agent interacting with an environment to perform a task. An observation defines environment state data and reward data. The system has an actor neural network which learns a policy function mapping the state data to action data. A critic neural network learns an action-value (Q) function. A replay buffer stores tuples of the state data, the action data, the reward data and new state data. The replay buffer also includes demonstration transition data comprising a set of the tuples from a demonstration of the task within the environment. The neural network system is configured to train the actor neural network and the critic neural network off-policy using stored tuples from the replay buffer comprising tuples both from operation of the system and from the demonstration transition data.
-
公开(公告)号:US11712799B2
公开(公告)日:2023-08-01
申请号:US17020294
申请日:2020-09-14
Applicant: DeepMind Technologies Limited
Inventor: Serkan Cabi , Ziyu Wang , Alexander Novikov , Ksenia Konyushkova , Sergio Gomez Colmenarejo , Scott Ellison Reed , Misha Man Ray Denil , Jonathan Karl Scholz , Oleg O. Sushkov , Rae Chan Jeong , David Barker , David Budden , Mel Vecerik , Yusuf Aytar , Joao Ferdinando Gomes de Freitas
IPC: B25J9/16
CPC classification number: B25J9/161 , B25J9/163 , B25J9/1661
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for data-driven robotic control. One of the methods includes maintaining robot experience data; obtaining annotation data; training, on the annotation data, a reward model; generating task-specific training data for the particular task, comprising, for each experience in a second subset of the experiences in the robot experience data: processing the observation in the experience using the trained reward model to generate a reward prediction, and associating the reward prediction with the experience; and training a policy neural network on the task-specific training data for the particular task, wherein the policy neural network is configured to receive a network input comprising an observation and to generate a policy output that defines a control policy for a robot performing the particular task.
-
-
-
-
-
-
-
-
-