-
公开(公告)号:US20240290321A1
公开(公告)日:2024-08-29
申请号:US18585168
申请日:2024-02-23
Applicant: Google LLC
Inventor: Yongqiang Wang , Yu Zhang , Wei Han , Parisa Haghani , Pedro J. Moreno Mengibar
CPC classification number: G10L15/063 , G10L15/26
Abstract: A method includes receiving training data including a corpus of multilingual unspoken textual utterances, a corpus of multilingual un-transcribed non-synthetic speech utterances, and a corpus of multilingual transcribed non-synthetic speech utterances. For each un-transcribed non-synthetic speech utterance, the method includes generating a target quantized vector token and a target token index, generating contrastive context vectors from corresponding masked audio features, and deriving a contrastive loss term. The method also includes generating an alignment output, generating a first probability distribution over possible speech recognition hypotheses for the alignment output, and determining an alignment output loss term. The method also includes generating a second probability distribution over possible speech recognition hypotheses and determining a non-synthetic speech loss term. The method also includes pre-training an audio encoder based on the contrastive loss term, the alignment output loss term, and the non-synthetic speech loss term.
-
2.
公开(公告)号:US20240013777A1
公开(公告)日:2024-01-11
申请号:US18320458
申请日:2023-05-19
Applicant: Google LLC
Inventor: Zhiyun Lu , Yu Zhang , Wei Han , Yongqiang Wang , Parisa Haghani , Zhehuai Chen
CPC classification number: G10L15/16 , G10L15/063
Abstract: A method includes obtaining a corpus of unlabeled training data including a plurality of spoken utterances, each corresponding spoken utterance of the plurality of spoken utterances includes audio data characterizing the corresponding spoken utterance. The method also includes receiving a target domain. The method also includes selecting, using a contrastive data selection model, a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model on the subset of utterances.
-