Text-to-speech using duration prediction

    公开(公告)号:US12100382B2

    公开(公告)日:2024-09-24

    申请号:US17492543

    申请日:2021-10-01

    Applicant: Google LLC

    CPC classification number: G10L13/027 G10L13/04

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

    MIXTURE-OF-EXPERT CONFORMER FOR STREAMING MULTILINGUAL ASR

    公开(公告)号:US20240304185A1

    公开(公告)日:2024-09-12

    申请号:US18598885

    申请日:2024-03-07

    Applicant: Google LLC

    CPC classification number: G10L15/197 G10L15/02 G10L15/063

    Abstract: A method of a multilingual ASR model includes receiving a sequence of acoustic frames characterizing an utterance of speech. At a plurality of output steps, the method further includes generating a first higher order feature representation for an acoustic frame by a first encoder that includes a first plurality of multi-head attention layers; generating a second higher order feature representation for a corresponding first higher order feature representation by a second encoder that includes a second plurality of multi-head attention layers; and generating, by a first decoder, a first probability distribution over possible speech recognition hypotheses based on the second higher order feature representation and a sequence of N previous non-blank symbols. A gating layer of each respective MoE layer configured to dynamically route an output from a previous multi-head attention layer at each of the plurality of output steps to a respective pair of feed-forward expert networks.

    CHUNK-WISE ATTENTION FOR LONGFORM ASR
    73.
    发明公开

    公开(公告)号:US20240290321A1

    公开(公告)日:2024-08-29

    申请号:US18585168

    申请日:2024-02-23

    Applicant: Google LLC

    CPC classification number: G10L15/063 G10L15/26

    Abstract: A method includes receiving training data including a corpus of multilingual unspoken textual utterances, a corpus of multilingual un-transcribed non-synthetic speech utterances, and a corpus of multilingual transcribed non-synthetic speech utterances. For each un-transcribed non-synthetic speech utterance, the method includes generating a target quantized vector token and a target token index, generating contrastive context vectors from corresponding masked audio features, and deriving a contrastive loss term. The method also includes generating an alignment output, generating a first probability distribution over possible speech recognition hypotheses for the alignment output, and determining an alignment output loss term. The method also includes generating a second probability distribution over possible speech recognition hypotheses and determining a non-synthetic speech loss term. The method also includes pre-training an audio encoder based on the contrastive loss term, the alignment output loss term, and the non-synthetic speech loss term.

    USING SPEECH RECOGNITION TO IMPROVE CROSS-LANGUAGE SPEECH SYNTHESIS

    公开(公告)号:US20240282292A1

    公开(公告)日:2024-08-22

    申请号:US18654278

    申请日:2024-05-03

    Applicant: Google LLC

    CPC classification number: G10L13/047 G10L13/086 G10L13/10

    Abstract: A method for training a speech recognition model includes obtaining a multilingual text-to-speech (TTS) model. The method also includes generating a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language. The method also includes generating a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language. The method also includes generating a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation. The method also includes determining a consistent loss term based on the first speech recognition result and the second speech recognition result and updating parameters of the speech recognition model based on the consistent loss term.

    Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech

    公开(公告)号:US20240062743A1

    公开(公告)日:2024-02-22

    申请号:US18499031

    申请日:2023-10-31

    Applicant: Google LLC

    CPC classification number: G10L13/08 G10L13/04

    Abstract: A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

    Dynamic Adjustment of Delivery Location Based on User Location

    公开(公告)号:US20230410032A1

    公开(公告)日:2023-12-21

    申请号:US18317614

    申请日:2023-05-15

    Applicant: Google LLC

    Inventor: Yu Zhang

    CPC classification number: G06Q10/08355

    Abstract: A user places an order on a merchant website associated with a merchant system via a user computing device. The user selects an option for delivery to the user computing device location within a delivery area during a delivery time window and authorizes a delivery system to log the location of the user computing device during and/or a period of time before the delivery time window. When the delivery time window arrives, the delivery system provides a delivery route to a delivery agent computing device. When the delivery agent arrives at the user computing device's location, the user receives an alert that the delivery agent has arrived and receives a package from the delivery agent. If the user does not remain within the delivery area, the user may cancel the order and the delivery, may reschedule the delivery, and/or may accept delivery of the order to a fixed shipping address.

    Injecting Text in Self-Supervised Speech Pre-training

    公开(公告)号:US20230017892A1

    公开(公告)日:2023-01-19

    申请号:US17808091

    申请日:2022-06-21

    Applicant: Google LLC

    Abstract: A method includes receiving training data that includes unspoken text utterances and un-transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances and the un-transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

    Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

    公开(公告)号:US20230013587A1

    公开(公告)日:2023-01-19

    申请号:US17722264

    申请日:2022-04-15

    Applicant: Google LLC

    Abstract: A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Patent Agency Ranking