-
公开(公告)号:US12032920B2
公开(公告)日:2024-07-09
申请号:US17056554
申请日:2020-03-07
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Melvin Johnson , Fadi Biadsy , Ron Weiss , Wolfgang Macherey
Abstract: The present disclosure provides systems and methods that train and use machine-learned models such as, for example, sequence-to-sequence models, to perform direct and text-free speech-to-speech translation. In particular, aspects of the present disclosure provide an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
-
公开(公告)号:US20240021190A1
公开(公告)日:2024-01-18
申请号:US17813322
申请日:2022-07-18
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro Jose Moreno Mengibar
CPC classification number: G10L15/063 , G10L15/16 , G10L13/02 , G10L15/22 , G10L2015/0635
Abstract: A method for training a sub-model for contextual biasing for speech recognition includes obtaining a base speech recognition model trained on non-biased data. The method includes obtaining a set of training utterances representative of a particular domain, each training utterance in the set of training utterances including audio data characterizing the training utterances and a ground truth transcription of the training utterance. The method further includes, for each corresponding training utterance in the set of training utterances, determining, using an embedding encoder, a corresponding document embedding from the ground truth transcription of the corresponding training utterance. The method includes training, using the corresponding document embeddings determined from the ground truth transcriptions of the set of training utterances, a sub-model to bias the base speech recognition model to recognize speech in the particular domain.
-
公开(公告)号:US11875789B2
公开(公告)日:2024-01-16
申请号:US18069070
申请日:2022-12-20
Applicant: Google LLC
Inventor: Fadi Biadsy , Diamantino Antonio Caseiro
IPC: G10L15/18 , G10L15/197 , G10L15/02 , G10L15/32 , G10L15/08 , G10L15/183 , G10L15/19 , G10L15/22
CPC classification number: G10L15/197 , G10L15/02 , G10L15/08 , G10L15/32 , G10L15/183 , G10L15/19 , G10L2015/226 , G10L2015/228
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for language models using domain-specific model components. In some implementations, context data for an utterance is obtained. A domain-specific model component is selected from among multiple domain-specific model components of a language model based on the non-linguistic context of the utterance. A score for a candidate transcription for the utterance is generated using the selected domain-specific model component and a baseline model component of the language model that is domain-independent. A transcription for the utterance is determined using the score the transcription is provided as output of an automated speech recognition system.
-
公开(公告)号:US20230298574A1
公开(公告)日:2023-09-21
申请号:US18184630
申请日:2023-03-15
Applicant: Google LLC
Inventor: Fadi Biadsy , Youzheng Chen , Xia Zhang , Oleg Rybakov , Andrew M. Rosenberg , Pedro J.Moreno Mengibar
CPC classification number: G10L15/16 , G10L15/063 , G10L15/02 , G10L2015/025
Abstract: A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.
-
公开(公告)号:US11580994B2
公开(公告)日:2023-02-14
申请号:US17153495
申请日:2021-01-20
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro Jose Moreno Mengibar
Abstract: A method includes receiving acoustic features of a first utterance spoken by a first user that speaks with typical speech and processing the acoustic features of the first utterance using a general speech recognizer to generate a first transcription of the first utterance. The operations also include analyzing the first transcription of the first utterance to identify one or more bias terms in the first transcription and biasing the alternative speech recognizer on the one or more bias terms identified in the first transcription. The operations also include receiving acoustic features of a second utterance spoken by a second user that speaks with atypical speech and processing, using the alternative speech recognizer biased on the one or more terms identified in the first transcription, the acoustic features of the second utterance to generate a second transcription of the second utterance.
-
公开(公告)号:US10134394B2
公开(公告)日:2018-11-20
申请号:US14708465
申请日:2015-05-11
Applicant: GOOGLE LLC
Inventor: Diamantino Antonio Caseiro , Fadi Biadsy
IPC: G10L15/197 , G06F17/27
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, relating to generating log-linear models. In some implementations, n-gram parameter values derived from an n-gram language model are obtained. N-gram features for a log-linear language model are determined based on the n-grams corresponding to the obtained n-gram parameter values. A weight for each of the determined n-gram features is determined, where the weight is determined based on (i) an n-gram parameter value that is derived from the n-gram language model and that corresponds to a particular n-gram, and (ii) an n-gram parameter value that is derived from the n-gram language model and that corresponds to an n-gram that is a sub-sequence within the particular n-gram. A log-linear language model having the determined n-gram features is generated, where the determined n-gram features in the log-linear language model have weights that are initialized based on the determined weights.
-
公开(公告)号:US12272348B2
公开(公告)日:2025-04-08
申请号:US17655030
申请日:2022-03-16
Applicant: Google LLC
Inventor: Bhuvana Ramabhadran , Zhehuai Chen , Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L13/027 , G10L13/047 , G10L15/16 , G10L15/22 , G10L25/18
Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.
-
公开(公告)号:US20250037700A1
公开(公告)日:2025-01-30
申请号:US18919366
申请日:2024-10-17
Applicant: Google LLC
Inventor: Fadi Biadsy , Dirk Ryan Padfield , Victoria Zayats
Abstract: A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.
-
公开(公告)号:US12136410B2
公开(公告)日:2024-11-05
申请号:US17661832
申请日:2022-05-03
Applicant: Google LLC
Inventor: Fadi Biadsy , Dirk Ryan Padfield , Victoria Zayats
Abstract: A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.
-
公开(公告)号:US20230298565A1
公开(公告)日:2023-09-21
申请号:US17660487
申请日:2022-04-25
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Gary Wang , Bhuvana Ramabhadran , Fadi Biadsy
IPC: G10L15/06 , G10L15/197 , G10L13/02 , G10L19/038 , G10L15/22
CPC classification number: G10L15/063 , G10L15/197 , G10L13/02 , G10L19/038 , G10L15/22 , G10L2015/0635 , G10L2019/0001
Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.
-
-
-
-
-
-
-
-
-