-
公开(公告)号:US11335324B2
公开(公告)日:2022-05-17
申请号:US17008278
申请日:2020-08-31
Applicant: Google LLC
Inventor: Fadi Biadsy , Liyang Jiang , Pedro J. Moreno Mengibar , Andrew Rosenberg
Abstract: A method for training a speech conversion model personalized for a target speaker with atypical speech includes obtaining a plurality of transcriptions in a set of spoken training utterances and obtaining a plurality of unspoken training text utterances. Each spoken training utterance is spoken by a target speaker associated with atypical speech and includes a corresponding transcription paired with a corresponding non-synthetic speech representation. The method also includes adapting, using the set of spoken training utterances, a text-to-speech (TTS) model to synthesize speech in a voice of the target speaker and that captures the atypical speech. For each unspoken training text utterance, the method also includes generating, as output from the adapted TTS model, a synthetic speech representation that includes the voice of the target speaker and that captures the atypical speech. The method also includes training the speech conversion model based on the synthetic speech representations.
-
公开(公告)号:US20220122579A1
公开(公告)日:2022-04-21
申请号:US17310732
申请日:2019-11-26
Applicant: Google LLC
Inventor: Fadi Biadsy , Ron J. Weiss , Aleksandar Kracun , Pedro J. Moreno Mengibar
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.
-
公开(公告)号:US20230395061A1
公开(公告)日:2023-12-07
申请号:US18319410
申请日:2023-05-17
Applicant: Google LLC
Inventor: Fadi Biadsy , Oleg Rybakov
IPC: G10L13/047 , G10L15/30 , G10L15/04 , G10L15/16
CPC classification number: G10L13/047 , G10L15/30 , G10L15/04 , G10L15/16
Abstract: A method for turn detection in a speech-to-speech model includes receiving, as input to the speech-to-speech (S2S) model, a sequence of acoustic frames corresponding to an utterance. The method further includes, at each of a plurality of output steps, generating, by an audio encoder of the S2S model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames, and determining, by a turn detector of the S2S model, based on the higher order feature representation generated by the audio encoder at the corresponding output step, whether the utterance is at a breakpoint at the corresponding output step. When the turn detector determines that the utterance is at the breakpoint, the method includes synthesizing a sequence of output audio frames output by a speech decoder of the S2S model into a time-domain audio waveform of synthesized speech representing the utterance spoken by the user.
-
公开(公告)号:US20230230572A1
公开(公告)日:2023-07-20
申请号:US18188524
申请日:2023-03-23
Applicant: Google LLC
Inventor: Fadi Biadsy , Ron J. Weiss , Aleksandar Kracun , Pedro J. Moreno Mengibar
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.
-
公开(公告)号:US20230169983A1
公开(公告)日:2023-06-01
申请号:US18159601
申请日:2023-01-25
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro J. Moreno Mengibar
Abstract: A method includes receiving acoustic features of a first utterance spoken by a first user that speaks with typical speech and processing the acoustic features of the first utterance using a general speech recognizer to generate a first transcription of the first utterance. The operations also include analyzing the first transcription of the first utterance to identify one or more bias terms in the first transcription and biasing the alternative speech recognizer on the one or more bias terms identified in the first transcription. The operations also include receiving acoustic features of a second utterance spoken by a second user that speaks with atypical speech and processing, using the alternative speech recognizer biased on the one or more terms identified in the first transcription, the acoustic features of the second utterance to generate a second transcription of the second utterance.
-
公开(公告)号:US20230009613A1
公开(公告)日:2023-01-12
申请号:US17756995
申请日:2019-12-13
Applicant: Google LLC
Inventor: Andrew Rosenberg , Bhuvana Ramabhadran , Fadi Biadsy , Yu Zhang
IPC: G10L13/047 , G10L13/08 , G10L15/16 , G10L15/06
Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.
-
公开(公告)号:US20190244610A1
公开(公告)日:2019-08-08
申请号:US16257856
申请日:2019-01-25
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L15/22
CPC classification number: G10L15/22 , G10L2015/223
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating expressions associated with voice commands. The methods, systems, and apparatus include actions of obtaining segments of one or more expressions associated with a voice command. Further actions include combining the segments into a candidate expression and scoring the candidate expression using a text corpus. Additional actions include selecting the candidate expression as an expression associated with the voice command based on the scoring of the candidate expression.
-
公开(公告)号:US12230258B2
公开(公告)日:2025-02-18
申请号:US17659836
申请日:2022-04-19
Applicant: Google LLC
Inventor: Fadi Biadsy , Pedro J. Moreno Mengibar
IPC: G10L15/183 , G06N3/04
Abstract: A method for contextual biasing for speech recognition includes obtaining a base automatic speech recognition (ASR) model trained on non-biased data and a sub-model trained on biased data representative of a particular domain. The method includes receiving a speech recognition request including audio data characterizing an utterance captured in streaming audio. The method further includes determining whether the speech recognition request includes a contextual indicator indicating the particular domain. When the speech recognition request does not include the contextual indicator, the method includes generating, using the base ASR model, a first speech recognition result of the utterance by processing the audio data. When the speech recognition request includes the contextual indicator the method includes biasing, using the sub-model, the base ASR model toward the particular domain and generating, using the biased base ASR model, a second speech recognition result of the utterance by processing the audio data.
-
公开(公告)号:US12205578B2
公开(公告)日:2025-01-21
申请号:US17788183
申请日:2021-01-07
Applicant: GOOGLE LLC
Inventor: Fadi Biadsy , Johan Schalkwyk , Jason Pelecanos
Abstract: Implementations disclosed herein are directed to techniques for selectively enabling and/or disabling non-transient storage of one or more instances of assistant interaction data for turn(s) of a dialog between a user and an automated assistant. Implementations are additionally or alternatively directed to techniques for retroactive wiping of non-transiently stored assistant interaction data from previous assistant interaction(s).
-
公开(公告)号:US12190862B2
公开(公告)日:2025-01-07
申请号:US17660487
申请日:2022-04-25
Applicant: Google LLC
Inventor: Andrew M. Rosenberg , Gary Wang , Bhuvana Ramabhadran , Fadi Biadsy
IPC: G10L15/06 , G10L13/02 , G10L15/197 , G10L15/22 , G10L19/038 , G10L21/003 , G10L15/16 , G10L19/00
Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.
-
-
-
-
-
-
-
-
-