Synthesized data augmentation using voice conversion and speech recognition models

    公开(公告)号:US11335324B2

    公开(公告)日:2022-05-17

    申请号:US17008278

    申请日:2020-08-31

    Applicant: Google LLC

    Abstract: A method for training a speech conversion model personalized for a target speaker with atypical speech includes obtaining a plurality of transcriptions in a set of spoken training utterances and obtaining a plurality of unspoken training text utterances. Each spoken training utterance is spoken by a target speaker associated with atypical speech and includes a corresponding transcription paired with a corresponding non-synthetic speech representation. The method also includes adapting, using the set of spoken training utterances, a text-to-speech (TTS) model to synthesize speech in a voice of the target speaker and that captures the atypical speech. For each unspoken training text utterance, the method also includes generating, as output from the adapted TTS model, a synthetic speech representation that includes the voice of the target speaker and that captures the atypical speech. The method also includes training the speech conversion model based on the synthetic speech representations.

    END-TO-END SPEECH CONVERSION
    12.
    发明申请

    公开(公告)号:US20220122579A1

    公开(公告)日:2022-04-21

    申请号:US17310732

    申请日:2019-11-26

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.

    Streaming Speech-to-speech Model With Automatic Speaker Turn Detection

    公开(公告)号:US20230395061A1

    公开(公告)日:2023-12-07

    申请号:US18319410

    申请日:2023-05-17

    Applicant: Google LLC

    CPC classification number: G10L13/047 G10L15/30 G10L15/04 G10L15/16

    Abstract: A method for turn detection in a speech-to-speech model includes receiving, as input to the speech-to-speech (S2S) model, a sequence of acoustic frames corresponding to an utterance. The method further includes, at each of a plurality of output steps, generating, by an audio encoder of the S2S model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames, and determining, by a turn detector of the S2S model, based on the higher order feature representation generated by the audio encoder at the corresponding output step, whether the utterance is at a breakpoint at the corresponding output step. When the turn detector determines that the utterance is at the breakpoint, the method includes synthesizing a sequence of output audio frames output by a speech decoder of the S2S model into a time-domain audio waveform of synthesized speech representing the utterance spoken by the user.

    END-TO-END SPEECH CONVERSION
    14.
    发明公开

    公开(公告)号:US20230230572A1

    公开(公告)日:2023-07-20

    申请号:US18188524

    申请日:2023-03-23

    Applicant: Google LLC

    CPC classification number: G10L13/02 G06N3/08 G10L21/10 G10L25/30 H04L51/02

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.

    SPEECH RECOGNITION
    15.
    发明公开
    SPEECH RECOGNITION 审中-公开

    公开(公告)号:US20230169983A1

    公开(公告)日:2023-06-01

    申请号:US18159601

    申请日:2023-01-25

    Applicant: Google LLC

    Abstract: A method includes receiving acoustic features of a first utterance spoken by a first user that speaks with typical speech and processing the acoustic features of the first utterance using a general speech recognizer to generate a first transcription of the first utterance. The operations also include analyzing the first transcription of the first utterance to identify one or more bias terms in the first transcription and biasing the alternative speech recognizer on the one or more bias terms identified in the first transcription. The operations also include receiving acoustic features of a second utterance spoken by a second user that speaks with atypical speech and processing, using the alternative speech recognizer biased on the one or more terms identified in the first transcription, the acoustic features of the second utterance to generate a second transcription of the second utterance.

    Training Speech Synthesis to Generate Distinct Speech Sounds

    公开(公告)号:US20230009613A1

    公开(公告)日:2023-01-12

    申请号:US17756995

    申请日:2019-12-13

    Applicant: Google LLC

    Abstract: A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

    FACTOR GRAPH FOR SEMANTIC PARSING
    17.
    发明申请

    公开(公告)号:US20190244610A1

    公开(公告)日:2019-08-08

    申请号:US16257856

    申请日:2019-01-25

    Applicant: Google LLC

    CPC classification number: G10L15/22 G10L2015/223

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating expressions associated with voice commands. The methods, systems, and apparatus include actions of obtaining segments of one or more expressions associated with a voice command. Further actions include combining the segments into a candidate expression and scoring the candidate expression using a text corpus. Additional actions include selecting the candidate expression as an expression associated with the voice command based on the scoring of the candidate expression.

    Sub-models for neural contextual biasing

    公开(公告)号:US12230258B2

    公开(公告)日:2025-02-18

    申请号:US17659836

    申请日:2022-04-19

    Applicant: Google LLC

    Abstract: A method for contextual biasing for speech recognition includes obtaining a base automatic speech recognition (ASR) model trained on non-biased data and a sub-model trained on biased data representative of a particular domain. The method includes receiving a speech recognition request including audio data characterizing an utterance captured in streaming audio. The method further includes determining whether the speech recognition request includes a contextual indicator indicating the particular domain. When the speech recognition request does not include the contextual indicator, the method includes generating, using the base ASR model, a first speech recognition result of the utterance by processing the audio data. When the speech recognition request includes the contextual indicator the method includes biasing, using the sub-model, the base ASR model toward the particular domain and generating, using the biased base ASR model, a second speech recognition result of the utterance by processing the audio data.

    Using non-parallel voice conversion for speech conversion models

    公开(公告)号:US12190862B2

    公开(公告)日:2025-01-07

    申请号:US17660487

    申请日:2022-04-25

    Applicant: Google LLC

    Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.

Patent Agency Ranking