REINFORCEMENT LEARNING BY SOLUTION OF A CONVEX MARKOV DECISION PROCESS

    公开(公告)号:US20240249151A1

    公开(公告)日:2024-07-25

    申请号:US18558894

    申请日:2022-05-27

    CPC classification number: G06N3/092 G06N3/045

    Abstract: The actions of an agent in an environment are selected using a policy model neural network which implements a policy model defining, for any observed state of the environment characterized by an observation received by the policy model neural network, a state-action distribution over the set of possible actions the agent can perform. The policy model neural network is jointly trained with a cost model neural network which, upon receiving an observation characterizing the environment, outputs a reward vector. The reward vector comprises a corresponding reward value for every possible action. The training involves a sequence of iterations, in each of which (a) a cost model is derived based on the state-action distribution of a candidate policy model defined in one or more previous iterations, and subsequently (b) a candidate policy model is obtained based on reward vector(s) defined by the cost model obtained in the iteration.

    METHODS AND SYSTEMS FOR CONSTRAINED REINFORCEMENT LEARNING

    公开(公告)号:US20240265263A1

    公开(公告)日:2024-08-08

    申请号:US18424437

    申请日:2024-01-26

    CPC classification number: G06N3/091

    Abstract: A method is described for iteratively training a policy model, such as a neural network, of a computer-implemented action selection system to control an agent interacting with an environment to perform a task subject to one or more constraints. The task has a reward associated with performance of the task. Each constraint limits to a corresponding threshold the expected value of the total of a corresponding constraint function which if the future actions of the agent are chosen according to the policy model, and each constraint is associated with a corresponding multiplier variable. In each iteration, a mixed reward function is generated based on values for the multiplier variables generated in the preceding iteration, and estimates of the rewards and the values of constraint reward functions if the actions are chosen based on the policy model generated in the preceding iteration.

Patent Agency Ranking