REWARD-MODEL BASED REINFORCEMENT LEARNING FOR PERFORMING REASONING TASKS

    公开(公告)号:US20240104391A1

    公开(公告)日:2024-03-28

    申请号:US18475743

    申请日:2023-09-27

    CPC classification number: G06N3/092

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for A training a language model for performing a reasoning task. The system obtains a plurality of training examples. Each training example includes a respective sample query text sequence characterizing a respective sample query and a respective reference response text sequence that includes a reference final answer to the respective sample query. The system trains a reward model on the plurality of training examples. The reward model is configured to receive an input including a query text sequence characterizing a query and one or more reasoning steps that have been generated in response to the query and process the input to compute a reward score indicating how successful the one or more reasoning steps are in yielding a correct final answer to the query. The system trains the language model using the trained reward model.

Patent Agency Ranking