Diffusion Models for Generation of Audio Data Based on Descriptive Textual Prompts

    公开(公告)号:US20240282294A1

    公开(公告)日:2024-08-22

    申请号:US18651296

    申请日:2024-04-30

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/16

    Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.

Patent Agency Ranking