-
公开(公告)号:US20220310081A1
公开(公告)日:2022-09-29
申请号:US17701635
申请日:2022-03-22
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/16 , G10L15/22 , G10L15/00
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis, and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US20240420692A1
公开(公告)日:2024-12-19
申请号:US18818010
申请日:2024-08-28
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
3.
公开(公告)号:US20250022458A1
公开(公告)日:2025-01-16
申请号:US18896830
申请日:2024-09-25
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/16
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
4.
公开(公告)号:US20220310074A1
公开(公告)日:2022-09-29
申请号:US17644344
申请日:2021-12-15
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
公开(公告)号:US12254875B2
公开(公告)日:2025-03-18
申请号:US18589220
申请日:2024-02-27
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US20230298570A1
公开(公告)日:2023-09-21
申请号:US18187222
申请日:2023-03-21
Applicant: Google LLC
Inventor: Weiran Wang , Tongzhou Chen , Tara N. Sainath , Ehsan Variani , Rohit Prakash Prabhavalkar , Ronny Huang , Bhuvana Ramabhadran , Neeraj Gaur , Sepand Mavandadi , Charles Caleb Peyser , Trevor Strohman , Yangzhang He , David Rybach
CPC classification number: G10L15/063 , G10L15/19 , G10L15/22 , G10L15/16 , G10L15/02
Abstract: A method includes generating, using an audio encoder, a higher-order feature representation for each acoustic frame in a sequence of acoustic frames; generating, using a decoder, based on the higher-order feature representation, a plurality of speech recognition hypotheses, each hypotheses corresponding to a candidate transcription of an utterance and having an associated first likelihood score; generating, using an external language model, for each speech recognition hypothesis, a second likelihood score; determining, using a learnable fusion module, for each speech recognition hypothesis, a set of fusion weights based on the higher-order feature representation and the speech recognition hypothesis; and generating, using the learnable fusion module, for each speech recognition hypothesis, a third likelihood score based on the first likelihood score, the second likelihood score, and the set of fusion weights, the audio encoder and decoder trained using minimum additive error rate training in the presence of the external language model.
-
7.
公开(公告)号:US20220310073A1
公开(公告)日:2022-09-29
申请号:US17644343
申请日:2021-12-15
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
8.
公开(公告)号:US12136415B2
公开(公告)日:2024-11-05
申请号:US17644343
申请日:2021-12-15
Applicant: Google LLC
Inventor: Kartik Audhkhasi , Bhuvana Ramabhadran , Tongzhou Chen , Pedro J. Moreno Mengibar
IPC: G10L15/16 , G06F1/03 , G06N3/04 , G06N3/0455 , G10L19/16
Abstract: A method for an automated speech recognition (ASR) model for unifying streaming and non-streaming speech recognition including receiving a sequence of acoustic frames. The method includes generating, using an audio encoder of an automatic speech recognition (ASR) model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method further includes generating, using a joint encoder of the ASR model, a probability distribution over possible speech recognition hypothesis at the corresponding time step based on the higher order feature representation generated by the audio encoder at the corresponding time step. The audio encoder comprises a neural network that applies mixture model (MiMo) attention to compute an attention probability distribution function (PDF) using a set of mixture components of softmaxes over a context window.
-
公开(公告)号:US12080283B2
公开(公告)日:2024-09-03
申请号:US17701635
申请日:2022-03-22
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
CPC classification number: G10L15/197 , G10L15/005 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
公开(公告)号:US20240203409A1
公开(公告)日:2024-06-20
申请号:US18589220
申请日:2024-02-27
Applicant: Google LLC
Inventor: Neeraj Gaur , Tongzhou Chen , Ehsan Variani , Bhuvana Ramabhadran , Parisa Haghani , Pedro J. Moreno Mengibar
IPC: G10L15/197 , G10L15/00 , G10L15/16 , G10L15/22
CPC classification number: G10L15/197 , G10L15/005 , G10L15/16 , G10L15/22
Abstract: A method includes receiving a sequence of acoustic frames extracted from audio data corresponding to an utterance. During a first pass, the method includes processing the sequence of acoustic frames to generate N candidate hypotheses for the utterance. During a second pass, and for each candidate hypothesis, the method includes: generating a respective un-normalized likelihood score; generating a respective external language model score; generating a standalone score that models prior statistics of the corresponding candidate hypothesis; and generating a respective overall score for the candidate hypothesis based on the un-normalized likelihood score, the external language model score, and the standalone score. The method also includes selecting the candidate hypothesis having the highest respective overall score from among the N candidate hypotheses as a final transcription of the utterance.
-
-
-
-
-
-
-
-
-