SEQUENCE PROCESSING USING ONLINE ATTENTION
    31.
    发明申请

    公开(公告)号:US20190332919A1

    公开(公告)日:2019-10-31

    申请号:US16504924

    申请日:2019-07-08

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating a target sequence including a respective output at each of multiple output time steps from respective encoded representations of inputs in an input sequence. The method includes, for each output time step, starting from the position, in the input order, of the encoded representation that was selected as a preceding context vector at a preceding output time step, traversing the encoded representations until an encoded representation is selected as a current context vector at the output time step. A decoder neural network processes the current context vector and a preceding output at the preceding output time step to generate a respective output score for each possible output and to update the hidden state of the decoder recurrent neural network. An output is selected for the output time step using the output scores.

    Parallel Tacotron Non-Autoregressive and Controllable TTS

    公开(公告)号:US20220122582A1

    公开(公告)日:2022-04-21

    申请号:US17327076

    申请日:2021-05-21

    Applicant: Google LLC

    Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.

    Multi-dialect and multilingual speech recognition

    公开(公告)号:US11238845B2

    公开(公告)日:2022-02-01

    申请号:US16684483

    申请日:2019-11-14

    Applicant: GOOGLE LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

Patent Agency Ranking