-
公开(公告)号:US20250095639A1
公开(公告)日:2025-03-20
申请号:US18962686
申请日:2024-11-27
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Gary Wang , Bhuvana Ramabhadran , Fadi Biadsy
IPC: G10L15/06 , G10L13/02 , G10L15/16 , G10L15/197 , G10L15/22 , G10L19/00 , G10L19/038 , G10L21/003
Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.
-
公开(公告)号:US12087272B2
公开(公告)日:2024-09-10
申请号:US17756995
申请日:2019-12-13
Applicant: Google LLC
Inventor: Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang
IPC: G10L15/16 , G10L13/047 , G10L13/08 , G10L15/06
CPC classification number: G10L13/047 , G10L13/086 , G10L15/063 , G10L15/16
Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
-
公开(公告)号:US20230360632A1
公开(公告)日:2023-11-09
申请号:US17661832
申请日:2022-05-03
Applicant: Google LLC
Inventor: Fadi Biadsy , Dirk Ryan Padfield , Victoria Zayats
Abstract: A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.
-
公开(公告)号:US20230335122A1
公开(公告)日:2023-10-19
申请号:US17659836
申请日:2022-04-19
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L15/183 , G06N3/04
CPC classification number: G10L15/183 , G06N3/04
Abstract: A method for contextual biasing for speech recognition includes obtaining a base automatic speech recognition (ASR) model trained on non-biased data and a sub-model trained on biased data representative of a particular domain. The method includes receiving a speech recognition request including audio data characterizing an utterance captured in streaming audio. The method further includes determining whether the speech recognition request includes a contextual indicator indicating the particular domain. When the speech recognition request does not include the contextual indicator, the method includes generating, using the base ASR model, a first speech recognition result of the utterance by processing the audio data. When the speech recognition request includes the contextual indicator the method includes biasing, using the sub-model, the base ASR model toward the particular domain and generating, using the biased base ASR model, a second speech recognition result of the utterance by processing the audio data.
-
公开(公告)号:US20230267949A1
公开(公告)日:2023-08-24
申请号:US18163848
申请日:2023-02-02
Applicant: Google LLC
Inventor: Oleg Rybakov , Liyang Jiang , Fadi Biadsy
Abstract: A method includes receiving a current spectrogram frame and reconstructing a phase of the current spectrogram frame by, for each corresponding committed spectrogram frame in a sequence of M number of committed spectrogram frames preceding the current spectrogram frame, obtaining a value of a committed phase of the corresponding committed spectrogram frame and estimating the phase of the current spectrogram frame based on a magnitude of the current spectrogram frame and the value of the committed phase of each corresponding committed spectrogram frame in the sequence of M number of committed spectrogram frames preceding the current spectrogram frame. The method also includes synthesizing, for the current spectrogram frame, a new time-domain audio waveform frame based on the estimated phase of the current spectrogram frame.
-
公开(公告)号:US20220310056A1
公开(公告)日:2022-09-29
申请号:US17655030
申请日:2022-03-16
Applicant: Google LLC
Inventor: Bhuvana Ramabhadran , Zhehuai Chen , Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L13/027 , G10L25/18 , G10L15/22 , G10L15/16 , G10L13/047
Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.
-
公开(公告)号:US20240127807A1
公开(公告)日:2024-04-18
申请号:US18391781
申请日:2023-12-21
Applicant: Google LLC
Inventor: Fadi Biadsy , Diamantino Antonio Caseiro
IPC: G10L15/197 , G10L15/02 , G10L15/18 , G10L15/32
CPC classification number: G10L15/197 , G10L15/02 , G10L15/18 , G10L15/32 , G10L15/183
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language models using domain-specific model components. In some implementations, context data for an utterance is obtained. A domain-specific model component is selected from among multiple domain-specific model components of a language model based on the non-linguistic context of the utterance. A score for a candidate transcription for the utterance is generated using the selected domain-specific model component and a baseline model component of the language model that is domain-independent. A transcription for the utterance is determined using the score the transcription is provided as output of an automated speech recognition system.
-
公开(公告)号:US11823685B2
公开(公告)日:2023-11-21
申请号:US18159601
申请日:2023-01-25
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro J. Moreno Mengibar
CPC classification number: G10L17/22 , G10L15/22 , G10L17/02 , G10L17/04 , G10L17/14 , G10L17/26 , G10L2015/227
Abstract: A method includes receiving acoustic features of a first utterance spoken by a first user that speaks with typical speech and processing the acoustic features of the first utterance using a general speech recognizer to generate a first transcription of the first utterance. The operations also include analyzing the first transcription of the first utterance to identify one or more bias terms in the first transcription and biasing the alternative speech recognizer on the one or more bias terms identified in the first transcription. The operations also include receiving acoustic features of a second utterance spoken by a second user that speaks with atypical speech and processing, using the alternative speech recognizer biased on the one or more terms identified in the first transcription, the acoustic features of the second utterance to generate a second transcription of the second utterance.
-
公开(公告)号:US20230122941A1
公开(公告)日:2023-04-20
申请号:US18069070
申请日:2022-12-20
Applicant: Google LLC
Inventor: Fadi Biadsy , Diamantino Antonio Caseiro
IPC: G10L15/197 , G10L15/02 , G10L15/32 , G10L15/18
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language models using domain-specific model components. In some implementations, context data for an utterance is obtained. A domain-specific model component is selected from among multiple domain-specific model components of a language model based on the non-linguistic context of the utterance. A score for a candidate transcription for the utterance is generated using the selected domain-specific model component and a baseline model component of the language model that is domain-independent. A transcription for the utterance is determined using the score the transcription is provided as output of an automated speech recognition system.
-
公开(公告)号:US11557289B2
公开(公告)日:2023-01-17
申请号:US17060347
申请日:2020-10-01
Applicant: Google LLC
Inventor: Fadi Biadsy , Diamantino Antionio Caseiro
IPC: G10L15/18 , G10L15/197 , G10L15/02 , G10L15/32 , G10L15/183 , G10L15/19 , G10L15/22
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language models using domain-specific model components. In some implementations, context data for an utterance is obtained. A domain-specific model component is selected from among multiple domain-specific model components of a language model based on the non-linguistic context of the utterance. A score for a candidate transcription for the utterance is generated using the selected domain-specific model component and a baseline model component of the language model that is domain-independent. A transcription for the utterance is determined using the score the transcription is provided as output of an automated speech recognition system.
-
-
-
-
-
-
-
-
-