GENERATING SYNTHETIC TRAINING DATA FOR PROGRAMMING LANGUAGE TRANSLATION

    公开(公告)号:US20240086164A1

    公开(公告)日:2024-03-14

    申请号:US17940618

    申请日:2022-09-08

    Applicant: Google LLC

    Inventor: Lucas Kramer Bin Ni

    CPC classification number: G06F8/51 G06F8/36 G06N20/00

    Abstract: Techniques are described herein for generating synthetic paired source code snippets that are semantically equivalent but syntactically distinct. In various implementations, few shot learning may be performed to prompt a large language model, based on demonstration source code snippet(s) in syntactically constrained pseudocode, to generate additional source code snippets in the syntactically constrained pseudocode. Based on additional source code snippets in additional programming language(s), the large language model may be used to generate more training source code snippets in the syntactically constrained pseudocode. The training source code snippets in the syntactically constrained pseudocode may be programmatically translated to generate synthetic training pairs of semantically equivalent source code snippets. Each synthetic training pair of the plurality of synthetic training pairs may include training snippets in first and second programming languages, and may be usable to train a machine learning translation model to translate between the first and second programming languages.

Patent Agency Ranking