Training Speech Synthesis to Generate Distinct Speech Sounds

    公开(公告)号:US20230009613A1

    公开(公告)日:2023-01-12

    申请号:US17756995

    申请日:2019-12-13

    Applicant: Google LLC

    Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

    Multi-Task Learning for End-To-End Automated Speech Recognition Confidence and Deletion Estimation

    公开(公告)号:US20220310080A1

    公开(公告)日:2022-09-29

    申请号:US17643826

    申请日:2021-12-11

    Applicant: Google LLC

    Abstract: A method including receiving a speech recognition result corresponding to a transcription of an utterance spoken by a user. For each sub-word unit in a sequence of hypothesized sub-word units of the speech recognition result, using a confidence estimation module to: obtain a respective confidence embedding associated with the corresponding output step when the corresponding sub-word unit was output from the first speech recognizer; generate a confidence feature vector; generate an acoustic context vector; and generate a respective confidence output score for the corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the confidence estimation module. The method also includes determining, based on the respective confidence output score generated for each sub-word unit in the sequence of hypothesized sub-word units, an utterance-level confidence score for the transcription of the utterance.

    Generating Diverse and Natural Text-To-Speech Samples

    公开(公告)号:US20220246132A1

    公开(公告)日:2022-08-04

    申请号:US17163007

    申请日:2021-01-29

    Applicant: Google LLC

    Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.

    Two-level text-to-speech systems using synthetic training data

    公开(公告)号:US12260851B2

    公开(公告)日:2025-03-25

    申请号:US17305809

    申请日:2021-07-14

    Applicant: Google LLC

    Abstract: A method includes obtaining training data including a plurality of training audio signals and corresponding transcripts. Each training audio signal is spoken by a target speaker in a first accent/dialect. For each training audio signal of the training data, the method includes generating a training synthesized speech representation spoken by the target speaker in a second accent/dialect different than the first accent/dialect and training a text-to-speech (TTS) system based on the corresponding transcript and the training synthesized speech representation. The method also includes receiving an input text utterance to be synthesized into speech in the second accent/dialect. The method also includes obtaining conditioning inputs that include a speaker embedding and an accent/dialect identifier that identifies the second accent/dialect. The method also includes generating an output audio waveform corresponding to a synthesized speech representation of the input text sequence that clones the voice of the target speaker in the second accent/dialect.

    Phonemes And Graphemes for Neural Text-to-Speech

    公开(公告)号:US20240339106A1

    公开(公告)日:2024-10-10

    申请号:US18746809

    申请日:2024-06-18

    Applicant: Google LLC

    CPC classification number: G10L13/086 G06F40/263 G06F40/279 G06N3/08 G10L13/047

    Abstract: A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.

Patent Agency Ranking