TRAINING A POLICY NEURAL NETWORK AND A VALUE NEURAL NETWORK

    公开(公告)号:US20180032863A1

    公开(公告)日:2018-02-01

    申请号:US15280711

    申请日:2016-09-29

    Applicant: Google Inc.

    Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media, for training a value neural network that is configured to receive an observation characterizing a state of an environment being interacted with by an agent and to process the observation in accordance with parameters of the value neural network to generate a value score. One of the systems performs operations that include training a supervised learning policy neural network; initializing initial values of parameters of a reinforcement learning policy neural network having a same architecture as the supervised learning policy network to the trained values of the parameters of the supervised learning policy neural network; training the reinforcement learning policy neural network on second training data; and training the value neural network to generate a value score for the state of the environment that represents a predicted long-term reward resulting from the environment being in the state.

    SELECTING REINFORCEMENT LEARNING ACTIONS USING GOALS AND OBSERVATIONS
    2.
    发明申请
    SELECTING REINFORCEMENT LEARNING ACTIONS USING GOALS AND OBSERVATIONS 审中-公开
    使用目标和观察选择加强学习行动

    公开(公告)号:US20160292568A1

    公开(公告)日:2016-10-06

    申请号:US15091840

    申请日:2016-04-06

    Applicant: Google Inc.

    CPC classification number: G06N3/08 G06N3/0454 G06N20/00

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for reinforcement learning using goals and observations. One of the methods includes receiving an observation characterizing a current state of the environment; receiving a goal characterizing a target state from a set of target states of the environment; processing the observation using an observation neural network to generate a numeric representation of the observation; processing the goal using a goal neural network to generate a numeric representation of the goal; combining the numeric representation of the observation and the numeric representation of the goal to generate a combined representation; processing the combined representation using an action score neural network to generate a respective score for each action in the predetermined set of actions; and selecting the action to be performed using the respective scores for the actions in the predetermined set of actions.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于使用目标和观测来加强学习。 其中一种方法包括接收表征当前环境状态的观测值; 从环境的一组目标状态接收表征目标状态的目标; 使用观察神经网络处理观测以产生观察的数字表示; 使用目标神经网络处理目标以生成目标的数字表示; 组合观察的数字表示和目标的数字表示以生成组合表示; 使用动作评分神经网络处理所述组合表示以针对所述预定动作组中的每个动作生成相应的分数; 以及使用预定动作集中的动作的各个分数来选择要执行的动作。

    TRAINING NEURAL NETWORKS USING A PRIORITIZED EXPERIENCE MEMORY

    公开(公告)号:US20170140269A1

    公开(公告)日:2017-05-18

    申请号:US15349894

    申请日:2016-11-11

    Applicant: Google Inc.

    CPC classification number: G06N3/08 G06N3/088 Y04S10/54

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a neural network used to select actions performed by a reinforcement learning agent interacting with an environment. In one aspect, a method includes maintaining a replay memory, where the replay memory stores pieces of experience data generated as a result of the reinforcement learning agent interacting with the environment. Each piece of experience data is associated with a respective expected learning progress measure that is a measure of an expected amount of progress made in the training of the neural network if the neural network is trained on the piece of experience data. The method further includes selecting a piece of experience data from the replay memory by prioritizing for selection pieces of experience data having relatively higher expected learning progress measures and training the neural network on the selected piece of experience data.

    CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING
    4.
    发明申请
    CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING 审中-公开
    连续加强深度控制

    公开(公告)号:US20170024643A1

    公开(公告)日:2017-01-26

    申请号:US15217758

    申请日:2016-07-22

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training an actor neural network used to select actions to be performed by an agent interacting with an environment. One of the methods includes obtaining a minibatch of experience tuples; and updating current values of the parameters of the actor neural network, comprising: for each experience tuple in the minibatch: processing the training observation and the training action in the experience tuple using a critic neural network to determine a neural network output for the experience tuple, and determining a target neural network output for the experience tuple; updating current values of the parameters of the critic neural network using errors between the target neural network outputs and the neural network outputs; and updating the current values of the parameters of the actor neural network using the critic neural network.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于训练用于选择由与环境交互的代理执行的动作的动作者神经网络。 其中一种方法包括获取经验元组的小批量; 并且更新所述演员神经网络的参数的当前值,包括:对于所述迷你服务中的每个经验元组:使用批评神经网络来处理训练观察和经验元组中的训练动作以确定体验元组的神经网络输出 ,并且为所述体验元组确定目标神经网络输出; 使用目标神经网络输出与神经网络输出之间的误差来更新评价神经网络参数的当前值; 并使用批评神经网络更新演员神经网络的参数的当前值。

    DISTRIBUTED TRAINING OF REINFORCEMENT LEARNING SYSTEMS
    8.
    发明申请
    DISTRIBUTED TRAINING OF REINFORCEMENT LEARNING SYSTEMS 审中-公开
    加强学习系统的分布式培训

    公开(公告)号:US20160232445A1

    公开(公告)日:2016-08-11

    申请号:US15016173

    申请日:2016-02-04

    Applicant: Google Inc.

    CPC classification number: G06N3/08 G06N3/0454 G06N3/0472

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for distributed training of reinforcement learning systems. One of the methods includes receiving, by a learner, current values of the parameters of the Q network from a parameter server, wherein each learner maintains a respective learner Q network replica and a respective target Q network replica; updating, by the learner, the parameters of the learner Q network replica maintained by the learner using the current values; selecting, by the learner, an experience tuple from a respective replay memory; computing, by the learner, a gradient from the experience tuple using the learner Q network replica maintained by the learner and the target Q network replica maintained by the learner; and providing, by the learner, the computed gradient to the parameter server.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于强化学习系统的分布式训练。 其中一种方法包括从学习者接收来自参数服务器的Q网络参数的当前值,其中每个学习者维护相应的学习者Q网络副本和相应的目标Q网络副本; 由学习者更新由学习者使用当前值维护的学习者Q网络副本的参数; 由学习者选择来自相应回放记忆的经验元组; 由学习者使用由学习者维护的学习者Q网络副本和学习者维护的目标Q网络副本的经验元组进行计算; 并且由学习者将计算的梯度提供给参数服务器。

    Pose estimation using long range features
    9.
    发明授权
    Pose estimation using long range features 有权
    姿态估计使用长距离特征

    公开(公告)号:US09062979B1

    公开(公告)日:2015-06-23

    申请号:US13936522

    申请日:2013-07-08

    Applicant: Google Inc.

    CPC classification number: G01C21/30 G01C21/20 G01C21/26 G01S17/875 G01S17/89

    Abstract: Aspects of the present disclosure relate to using an object detected at long range to increase the accuracy of a location and heading estimate based on near range information. For example, an autonomous vehicle may use data points collected from a sensor such as a laser to generate an environmental map of environmental features. The environmental map is then compared to pre-stored map data to determine the vehicle's geographic location and heading. A second sensor, such as a laser or camera, having a longer range than the first sensor may detect an object outside of the range and field of view of the first sensor. For example, the object may have retroreflective properties which make it identifiable in a camera image or from laser data points. The location of the object is then compared to the pre-stored map data and used to refine the vehicle's estimated location and heading.

    Abstract translation: 本公开的方面涉及使用在远距离检测的对象来增加基于近距离信息的位置和航向估计的准确性。 例如,自主车辆可以使用从诸如激光器的传感器收集的数据点来生成环境特征的环境图。 然后将环境地图与预先存储的地图数据进行比较,以确定车辆的地理位置和航向。 具有比第一传感器更长的距离的第二传感器,例如激光器或照相机,可以检测第一传感器的范围和视场之外的物体。 例如,对象可以具有使其在照相机图像中或从激光数据点可识别的逆反射性质。 然后将对象的位置与预先存储的地图数据进行比较,并用于细化车辆的估计位置和航向。

Patent Agency Ranking