USING SPEECH RECOGNITION TO IMPROVE CROSS-LANGUAGE SPEECH SYNTHESIS

    公开(公告)号:EP4407605A3

    公开(公告)日:2024-10-23

    申请号:EP24182760.9

    申请日:2021-10-20

    Applicant: Google LLC

    Abstract: Disclosed herein is a computer-implemented for training a speech recognition model, the operations comprising: obtaining a multilingual text-to-speech (TTS) model; generating, using the multilingual TTS model, a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language; generating, using the multilingual TTS model, a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language; generating, using the speech recognition model, a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation; determining a consistent loss term based on the first speech recognition result and the second speech recognition result; and updating parameters of the speech recognition model based on the consistent loss term.

    USING SPEECH RECOGNITION TO IMPROVE CROSS-LANGUAGE SPEECH SYNTHESIS

    公开(公告)号:EP4407605A2

    公开(公告)日:2024-07-31

    申请号:EP24182760.9

    申请日:2021-10-20

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/063 G10L13/02

    Abstract: Disclosed herein is a computer-implemented for training a speech recognition model, the operations comprising: obtaining a multilingual text-to-speech (TTS) model; generating, using the multilingual TTS model, a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language; generating, using the multilingual TTS model, a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language; generating, using the speech recognition model, a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation; determining a consistent loss term based on the first speech recognition result and the second speech recognition result; and updating parameters of the speech recognition model based on the consistent loss term.

    MULTILINGUAL RE-SCORING MODELS FOR AUTOMATIC SPEECH RECOGNITION

    公开(公告)号:EP4488993A2

    公开(公告)日:2025-01-08

    申请号:EP24215267.6

    申请日:2022-03-22

    Applicant: GOOGLE LLC

    Abstract: A method (400) includes receiving a sequence of acoustic frames (110) extracted from audio data corresponding to an utterance (106). During a first pass (301), the method includes processing the sequence of acoustic frames to generate N candidate hypotheses (204) for the utterance. During a second pass (302), and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score (325); generating a respective external language model score (315); generating a standalone score (205) that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score (355) for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription (120) of the utterance.

    LANGUAGE-AGNOSTIC MULTILINGUAL MODELING USING EFFECTIVE SCRIPT NORMALIZATION

    公开(公告)号:EP4361897A3

    公开(公告)日:2024-07-17

    申请号:EP24162746.2

    申请日:2021-01-19

    Applicant: GOOGLE LLC

    Abstract: A method (600) includes obtaining a plurality of training data sets (202) each associated with a respective native language and includes a plurality of respective training data samples (204). For each respective training data sample of each training data set in the respective native language, the method includes transliterating the corresponding transcription in the respective native script into corresponding transliterated text (121) representing the respective native language of the corresponding audio in a target script and associating the corresponding transliterated text in the target script with the corresponding audio (210) in the respective native language to generate a respective normalized training data sample (240). The method also includes training, using the normalized training data samples, a multilingual model (300) to predict speech recognition results (120) in the target script for corresponding speech utterances (106) spoken in any of the different native languages associated with the plurality of training data sets.

Patent Agency Ranking