-
公开(公告)号:US11908448B2
公开(公告)日:2024-02-20
申请号:US17327076
申请日:2021-05-21
Applicant: Google LLC
Inventor: Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun
IPC: G10L13/08 , G10L13/047 , G06F40/126 , G10L21/10 , G06N3/08 , G06N3/088 , G06N3/044 , G06N3/045 , G06N3/048
CPC classification number: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10 , G06N3/048
Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.
-
22.
公开(公告)号:US20240029715A1
公开(公告)日:2024-01-25
申请号:US18355508
申请日:2023-07-20
Applicant: Google LLC
Inventor: Andrew Rosenberg , Zhehuai Chen , Ankur Bapna , Yu Zhang , Bhuvana Ramabhadran
IPC: G10L15/06
CPC classification number: G10L15/063
Abstract: A method includes receiving training data that includes unspoken textual utterances in a target language. Each unspoken textual utterance not paired with any corresponding spoken utterance of non-synthetic speech. The method also includes generating a corresponding alignment output for each unspoken textual utterance using an alignment model trained on transcribed speech utterance in one or more training languages each different than the target language. The method also includes generating a corresponding encoded textual representation for each alignment output using a text encoder and training a speech recognition model on the encoded textual representations generated for the alignment outputs. Training the speech recognition model teaches the speech recognition model to learn how to recognize speech in the target language.
-
公开(公告)号:US11475874B2
公开(公告)日:2022-10-18
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US20220301543A1
公开(公告)日:2022-09-22
申请号:US17326542
申请日:2021-05-21
Applicant: Google LLC
Inventor: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.
-
公开(公告)号:US20220270597A1
公开(公告)日:2022-08-25
申请号:US17182592
申请日:2021-02-23
Applicant: Google LLC
Inventor: David Qiu , Qiujia Li , Yanzhang He , Yu Zhang , Bo Li , Liangliang Cao , Rohit Prabhavalkar , Deepti Bhatia , Wei Li , Ke Hu , Tara Sainath , Ian Mcgraw
Abstract: A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.
-
公开(公告)号:US20220189456A1
公开(公告)日:2022-06-16
申请号:US17455667
申请日:2021-11-18
Applicant: Google LLC
Inventor: Ruoming Pang , Andros Tjandra , Yu Zhang , Shigeki Karita
IPC: G10L13/027 , G10L21/0308
Abstract: A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.
-
公开(公告)号:US11080599B2
公开(公告)日:2021-08-03
申请号:US16692538
申请日:2019-11-22
Applicant: Google LLC
Inventor: Navdeep Jaitly , Yu Zhang , William Chan
Abstract: A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective encoded representation at each of a plurality of time reduced time steps, and the number of time reduced time steps is less than the number of input time steps. The encoder neural network includes a time reduction subnetwork, a convolutional LSTM subnetwork, and a network in network subnetwork. The decoder neural network receives the encoded sequence and processes the encoded sequence to generate, for each position in an output sequence order, a set of sub string scores that includes a respective sub string score for each substring in a set of substrings.
-
28.
公开(公告)号:US12272363B2
公开(公告)日:2025-04-08
申请号:US17722264
申请日:2022-04-15
Applicant: Google LLC
Inventor: Andrew Rosenberg , Zhehuai Chen , Bhuvana Ramabhadran , Pedro J. Moreno Mengibar , Yuan Wang , Yu Zhang
Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.
-
公开(公告)号:US20250095637A1
公开(公告)日:2025-03-20
申请号:US18886581
申请日:2024-09-16
Applicant: Google LLC
Inventor: Ke Hu , Tara N. Sainath , Bo Li , Yu Zhang , Yong Cheng , Tao Wang , Yujing Zhang , Frederick Liu
Abstract: A method includes receiving a textual prompt in a first language and obtaining a fine-tuned prompt embedding configured to guide a large language model (LLM) to generate text in a target language from textual prompts in the first language. The method also includes processing, using the LLM, the textual prompt conditioned on the fine-tuned prompt embedding to generate output text in the target language and concatenating the textual prompt and the generated output text to provide an unspoken textual utterance. The method also includes training a multilingual automatic speech recognition (ASR) model to learn how to recognize speech in the target language by injecting the unspoken textual utterance into a text encoder associated with the multilingual ASR model.
-
公开(公告)号:US12249315B2
公开(公告)日:2025-03-11
申请号:US18499031
申请日:2023-10-31
Applicant: Google LLC
Inventor: Isaac Elias , Byungha Chun , Jonathan Shen , Ye Jia , Yu Zhang , Yonghui Wu
Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.
-
-
-
-
-
-
-
-
-