-
公开(公告)号:US20240282294A1
公开(公告)日:2024-08-22
申请号:US18651296
申请日:2024-04-30
Applicant: Google LLC
Inventor: Qingqing Huang , Daniel Sung-Joon Park , Aren Jansen , Timo Immanuel Denk , Yue Li , Ravi Ganti , Dan Ellis , Tao Wang , Wei Han , Joonseok Lee
CPC classification number: G10L15/063 , G10L15/16
Abstract: A corpus of textual data is generated with a machine-learned text generation model. The corpus of textual data includes a plurality of sentences. Each sentence is descriptive of a type of audio. For each of a plurality of audio recordings, the audio recording is processed with a machine-learned audio classification model to obtain training data including the audio recording and one or more sentences of the plurality of sentences closest to the audio recording within a joint audio-text embedding space of the machine-learned audio classification model. The sentence(s) are processed with a machine-learned generation model to obtain an intermediate representation of the one or more sentences. The intermediate representation is processed with a machine-learned cascaded diffusion model to obtain audio data. The machine-learned cascaded diffusion model is trained based on a difference between the audio data and the audio recording.