Sub-models for Neural Contextual Biasing with Attention and Embedding Space

    公开(公告)号:US20240021190A1

    公开(公告)日:2024-01-18

    申请号:US17813322

    申请日:2022-07-18

    Applicant: Google LLC

    Abstract: A method for training a sub-model for contextual biasing for speech recognition includes obtaining a base speech recognition model trained on non-biased data. The method includes obtaining a set of training utterances representative of a particular domain, each training utterance in the set of training utterances including audio data characterizing the training utterances and a ground truth transcription of the training utterance. The method further includes, for each corresponding training utterance in the set of training utterances, determining, using an embedding encoder, a corresponding document embedding from the ground truth transcription of the corresponding training utterance. The method includes training, using the corresponding document embeddings determined from the ground truth transcriptions of the set of training utterances, a sub-model to bias the base speech recognition model to recognize speech in the particular domain.

    Scalable Model Specialization Framework for Speech Model Personalization

    公开(公告)号:US20230298574A1

    公开(公告)日:2023-09-21

    申请号:US18184630

    申请日:2023-03-15

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L15/02 G10L2015/025

    Abstract: A method for speech conversion includes obtaining a speech conversion model configured to convert input utterances of human speech directly into corresponding output utterances of synthesized speech. The method further includes receiving a speech conversion request including input audio data corresponding to an utterance spoken by a target speaker associated with atypical speech and a speaker identifier uniquely identifying the target speaker. The method includes activating, using the speaker identifier, a particular sub-model for biasing the speech conversion model to recognize a type of the atypical speech associated with the target speaker identified by the speaker identifier. The method includes converting, using the speech conversion model biased by the activated particular sub-model, the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into output audio data corresponding to a synthesized canonical fluent speech representation of the utterance spoken by the target speaker.

    Speech recognition
    25.
    发明授权

    公开(公告)号:US11580994B2

    公开(公告)日:2023-02-14

    申请号:US17153495

    申请日:2021-01-20

    Applicant: Google LLC

    Abstract: A method includes receiving acoustic features of a first utterance spoken by a first user that speaks with typical speech and processing the acoustic features of the first utterance using a general speech recognizer to generate a first transcription of the first utterance. The operations also include analyzing the first transcription of the first utterance to identify one or more bias terms in the first transcription and biasing the alternative speech recognizer on the one or more bias terms identified in the first transcription. The operations also include receiving acoustic features of a second utterance spoken by a second user that speaks with atypical speech and processing, using the alternative speech recognizer biased on the one or more terms identified in the first transcription, the acoustic features of the second utterance to generate a second transcription of the second utterance.

    Speech recognition using log-linear model

    公开(公告)号:US10134394B2

    公开(公告)日:2018-11-20

    申请号:US14708465

    申请日:2015-05-11

    Applicant: GOOGLE LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, relating to generating log-linear models. In some implementations, n-gram parameter values derived from an n-gram language model are obtained. N-gram features for a log-linear language model are determined based on the n-grams corresponding to the obtained n-gram parameter values. A weight for each of the determined n-gram features is determined, where the weight is determined based on (i) an n-gram parameter value that is derived from the n-gram language model and that corresponds to a particular n-gram, and (ii) an n-gram parameter value that is derived from the n-gram language model and that corresponds to an n-gram that is a sub-sequence within the particular n-gram. A log-linear language model having the determined n-gram features is generated, where the determined n-gram features in the log-linear language model have weights that are initialized based on the determined weights.

    Conformer-based speech conversion model

    公开(公告)号:US12272348B2

    公开(公告)日:2025-04-08

    申请号:US17655030

    申请日:2022-03-16

    Applicant: Google LLC

    Abstract: A method for speech conversion includes receiving, as input to an encoder of a speech conversion model, an input spectrogram corresponding to an utterance, the encoder including a stack of self-attention blocks. The method further includes generating, as output from the encoder, an encoded spectrogram and receiving, as input to a spectrogram decoder of the speech conversion model, the encoded spectrogram generated as output from the encoder. The method further includes generating, as output from the spectrogram decoder, an output spectrogram corresponding to a synthesized speech representation of the utterance.

    SPEAKER EMBEDDINGS FOR IMPROVED AUTOMATIC SPEECH RECOGNITION

    公开(公告)号:US20250037700A1

    公开(公告)日:2025-01-30

    申请号:US18919366

    申请日:2024-10-17

    Applicant: Google LLC

    Abstract: A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.

    Speaker embeddings for improved automatic speech recognition

    公开(公告)号:US12136410B2

    公开(公告)日:2024-11-05

    申请号:US17661832

    申请日:2022-05-03

    Applicant: Google LLC

    Abstract: A method includes receiving a reference audio signal corresponding to reference speech spoken by a target speaker with atypical speech, and generating, by a speaker embedding network configured to receive the reference audio signal as input, a speaker embedding for the target speaker. The speaker embedding conveys speaker characteristics of the target speaker. The method also includes receiving a speech conversion request that includes input audio data corresponding to an utterance spoken by the target speaker associated with the atypical speech. The method also includes biasing, using the speaker embedding generated for the target speaker by the speaker embedding network, a speech conversion model to convert the input audio data corresponding to the utterance spoken by the target speaker associated with atypical speech into an output canonical representation of the utterance spoken by the target speaker.

    Using Non-Parallel Voice Conversion for Speech Conversion Models

    公开(公告)号:US20230298565A1

    公开(公告)日:2023-09-21

    申请号:US17660487

    申请日:2022-04-25

    Applicant: Google LLC

    Abstract: A method includes receiving a set of training utterances each including a non-synthetic speech representation of a corresponding utterance, and for each training utterance, generating a corresponding synthetic speech representation by using a voice conversion model. The non-synthetic speech representation and the synthetic speech representation form a corresponding training utterance pair. At each of a plurality of output steps for each training utterance pair, the method also includes generating, for output by a speech recognition model, a first probability distribution over possible non-synthetic speech recognition hypotheses for the non-synthetic speech representation and a second probability distribution over possible synthetic speech recognition hypotheses for the synthetic speech representation. The method also includes determining a consistent loss term for the corresponding training utterance pair based on the first and second probability distributions and updating parameters of the speech recognition model based on the consistent loss term.

Patent Agency Ranking