Patent search ap:("GOOGLE LLC") AND inv:"Arun Narayanan" Page 2

11.

发明申请
CONTEXT AWARE BEAMFORMING OF AUDIO DATA 有权

公开(公告)号：US20220319498A1

公开(公告)日：2022-10-06

申请号：US17221220

申请日：2021-04-02

Applicant: Google LLC

Inventor： Joseph Caroselli, JR. , Yiteng Huang , Arun Narayanan

IPC: G10L15/08 , G10L21/0216 , G10L15/05 , G06N20/00

Abstract: Implementations disclosed herein are directed to initializing and utilizing a beamformer in processing of audio data received at a computing device. The computing device can: receive audio data that captures a spoken utterance of a user, determine that a first audio data segment of the audio data includes one or more particular words or phrases; obtain a preceding audio data segment that precedes the first audio data segment; estimate a spatial correlation matrix based on the first audio data segment and based on the preceding audio data segment; initialize the beamformer based on the estimated spatial correlation matrix; and cause the initialized beamformer to be utilized in processing of at least a second audio data segment of the audio data. Additionally, or alternatively, the computing device can transmit the spatial correlation matrix to server(s), and the server(s) can transmit the initialized beamformer back to the computing device.

12.

发明申请
TWO-PASS END TO END SPEECH RECOGNITION 有权

公开(公告)号：US20220238101A1

公开(公告)日：2022-07-28

申请号：US17616135

申请日：2020-12-03

Applicant: GOOGLE LLC

Inventor： Tara N. Sainath , Yanzhang He , Bo Li , Arun Narayanan , Ruoming Pang , Antoine Jean Bruguier , Shuo-yiin Chang , Wei Li

IPC: G10L15/16 , G10L15/05 , G10L15/22 , G10L15/06 , G06N3/08

Abstract: Two-pass automatic speech recognition (ASR) models can be used to perform streaming on-device ASR to generate a text representation of an utterance captured in audio data. Various implementations include a first-pass portion of the ASR model used to generate streaming candidate recognition(s) of an utterance captured in audio data. For example, the first-pass portion can include a recurrent neural network transformer (RNN-T) decoder. Various implementations include a second-pass portion of the ASR model used to revise the streaming candidate recognition(s) of the utterance and generate a text representation of the utterance. For example, the second-pass portion can include a listen attend spell (LAS) decoder. Various implementations include a shared encoder shared between the RNN-T decoder and the LAS decoder.

13.

发明申请
Fast Emit Low-latency Streaming ASR with Sequence-level Emission Regularization 有权

公开(公告)号：US20220122586A1

公开(公告)日：2022-04-21

申请号：US17447285

申请日：2021-09-09

Applicant: Google LLC

Inventor： Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang

IPC: G10L15/06 , G10L15/22 , G10L15/30 , G10L15/16

Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

14.

发明申请
ENHANCED MULTI-CHANNEL ACOUSTIC MODELS 审中-公开

公开(公告)号：US20190259409A1

公开(公告)日：2019-08-22

申请号：US16278830

申请日：2019-02-19

Applicant: Google LLC

Inventor： Ehsan Variani , Kevin William Wilson , Ron J. Weiss , Tara N. Sainath , Arun Narayanan

IPC: G10L25/30 , G10L21/028 , G10L19/008 , G10L15/20 , G10L15/16 , G10L21/0388

Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

15.

发明申请
Joint Acoustic Echo Cancelation, Speech Enhancement, and Voice Separation for Automatic Speech Recognition 有权

公开(公告)号：US20250029624A1

公开(公告)日：2025-01-23

申请号：US18906761

申请日：2024-10-04

Applicant: Google LLC

Inventor： Arun Narayanan , Tom O'malley , Quan Wang , Alex Park , James Walker , Nathan David Howard , Yanzhang He , Chung-Cheng Chiu

IPC: G10L21/0216 , G06N3/04 , G10L15/06 , G10L21/0208 , H04R3/04

Abstract: A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.

16.

发明申请
Transducer-Based Streaming Deliberation for Cascaded Encoders 有权

公开(公告)号：US20240428786A1

公开(公告)日：2024-12-26

申请号：US18826655

申请日：2024-09-06

Applicant: Google LLC

Inventor： Ke Hu , Tara N. Sainath , Arun Narayanan , Ruoming Pang , Trevor Strohman

IPC: G10L15/197 , G06F40/126 , G10L15/02 , G10L15/06 , G10L15/08 , G10L15/22

Abstract: A method includes receiving a sequence of acoustic frames and generating, by a first encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by a first pass transducer decoder, a first pass speech recognition hypothesis for a corresponding first higher order feature representation and generating, by a text encoder, a text encoding for a corresponding first pass speech recognition hypothesis. The method also includes generating, by a second encoder, a second higher order feature representation for a corresponding first higher order feature representation. The method also includes generating, by a second pass transducer decoder, a second pass speech recognition hypothesis using a corresponding second higher order feature representation and a corresponding text encoding.

17.

发明公开
Microphone Array Configuration Invariant, Streaming, Multichannel Neural Enhancement Frontend for Automatic Speech Recognition 审中-公开

公开(公告)号：US20230298612A1

公开(公告)日：2023-09-21

申请号：US18171411

申请日：2023-02-20

Applicant: Google LLC

Inventor： Joseph Caroselli , Arun Narayanan , Tom O'malley

IPC: G10L21/0232 , G10L25/30 , H04S3/00 , G10L15/22 , G10L15/06 , G10L15/16 , G10L25/18

CPC classification number: G10L21/0232 , G10L25/30 , H04S3/008 , G10L15/22 , G10L15/063 , G10L15/16 , G10L25/18 , H04S2400/01 , G10L2021/02082

Abstract: A multichannel neural frontend speech enhancement model for speech recognition includes a speech cleaner, a stack of self-attention blocks each having a multi-headed self attention mechanism, and a masking layer. The speech cleaner receives, as input, a multichannel noisy input signal and a multichannel contextual noise signal, and generates, as output, a single channel cleaned input signal. The stack of self-attention blocks receives, as input, at an initial block of the stack of self-attention blocks, a stacked input including the single channel cleaned input signal and a single channel noisy input signal, and generates, as output, from a final block of the stack of self-attention blocks, an un-masked output. The masking layer receives, as input, the single channel noisy input signal and the un-masked output, and generates, as output, enhanced input speech features corresponding to a target utterance.

18.

发明公开
Generalized Automatic Speech Recognition for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation 审中-公开

公开(公告)号：US20230298609A1

公开(公告)日：2023-09-21

申请号：US18171368

申请日：2023-02-19

Applicant: Google LLC

Inventor： Tom O'Malley , Quan Wang , Arun Narayanan

IPC: G10L21/0208 , G10L15/06

CPC classification number: G10L21/0208 , G10L15/063 , G10L2021/02082

Abstract: A method for training a generalized automatic speech recognition model for joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving a plurality of training utterances paired with corresponding training contextual signals. The training contextual signals include a training contextual noise signal including noise prior to the corresponding training utterance, a training reference audio signal, and a training speaker vector including voice characteristics of a target speaker that spoke the corresponding training utterance. The operations also include training, using a contextual signal dropout strategy, a contextual frontend processing model on the training utterances to learn how to predict enhanced speech features. Here, the contextual signal dropout strategy uses a predetermined probability to drop out each of the training contextual signals during training of the contextual frontend processing model.

19.

发明申请
AUTOMATED CALLING SYSTEM 有权

公开(公告)号：US20230038343A1

公开(公告)日：2023-02-09

申请号：US17964141

申请日：2022-10-12

Applicant: GOOGLE LLC

Inventor： Asaf Aharoni , Arun Narayanan , Nir Shabat , Parisa Haghani , Galen Tsai Chuang , Yaniv Leviathan , Neeraj Gaur , Pedro J. Moreno Mengibar , Rohit Prakash Prabhavalkar , Zhongdi Qu , Austin Severn Waters , Tomer Amiaz , Michiel A.U. Bacchiani

IPC: G10L15/26 , H04M3/428 , H04M1/663 , G10L15/32 , H04M3/51 , H04M1/02

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for an automated calling system are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance spoken by a user who is having a telephone conversation with a bot. The actions further include determining a context of the telephone conversation. The actions further include determining a user intent of a first previous portion of the telephone conversation spoken by the user and a bot intent of a second previous portion of the telephone conversation outputted by a speech synthesizer of the bot. The actions further include, based on the audio data of the utterance, the context of the telephone conversation, the user intent, and the bot intent, generating synthesized speech of a reply by the bot to the utterance. The actions further include, providing, for output, the synthesized speech.

20.

发明授权
Automated calling system 有权

公开(公告)号：US11495233B2

公开(公告)日：2022-11-08

申请号：US17505913

申请日：2021-10-20

Applicant: GOOGLE LLC

Inventor： Asaf Aharoni , Arun Narayanan , Nir Shabat , Parisa Haghani , Galen Tsai Chuang , Yaniv Leviathan , Neeraj Gaur , Pedro J. Moreno Mengibar , Rohit Prakash Prabhavalkar , Zhongdi Qu , Austin Severn Waters , Tomer Amiaz , Michiel A.U. Bacchiani

IPC: G10L15/26 , G10L15/32 , H04M3/51 , H04M1/663 , H04M3/428 , H04M1/02

Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for an automated calling system are disclosed. In one aspect, a method includes the actions of receiving audio data of an utterance spoken by a user who is having a telephone conversation with a bot. The actions further include determining a context of the telephone conversation. The actions further include determining a user intent of a first previous portion of the telephone conversation spoken by the user and a bot intent of a second previous portion of the telephone conversation outputted by a speech synthesizer of the bot. The actions further include, based on the audio data of the utterance, the context of the telephone conversation, the user intent, and the bot intent, generating synthesized speech of a reply by the bot to the utterance. The actions further include, providing, for output, the synthesized speech.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification