-
公开(公告)号:US20230186901A1
公开(公告)日:2023-06-15
申请号:US18167454
申请日:2023-02-10
Applicant: Google LLC
Inventor: Tara N. Sainath , Ruoming Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC: G10L15/06 , G06N3/08 , G10L15/16 , G10L15/197
CPC classification number: G10L15/063 , G06N3/08 , G10L15/16 , G10L15/197 , G10L2015/0635
Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
公开(公告)号:US20210225362A1
公开(公告)日:2021-07-22
申请号:US17155010
申请日:2021-01-21
Applicant: Google LLC
Inventor: Tara N. Sainath , Ruorning Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC: G10L15/06 , G10L15/16 , G10L15/197 , G06N3/08
Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
3.
公开(公告)号:US10210884B2
公开(公告)日:2019-02-19
申请号:US15600321
申请日:2017-05-19
Applicant: Google LLC
Inventor: Richard Francis Lyon , Ron Weiss , Thomas Chadwick Walters
IPC: G10L21/02 , G10L25/51 , G10L21/028 , G10L21/0208 , G10L21/0272 , G10L21/0308 , G10L21/0356 , G10L21/0388
Abstract: Systems and methods facilitating removal of content from audio files are described. A method includes identifying a sound recording in a first audio file, identifying a reference file having at least a defined level of similarity to the sound recording, and processing the first audio file to remove the sound recording and generate a second audio file. In some embodiments, winner-take-all coding and Hough transforms are employed for determining alignment and rate adjustment of the reference file in the first audio file. After alignment, the reference file is filtered in the frequency domain to increase similarity between the reference file and the sound recording. The frequency domain representation (FR) of the filtered version is subtracted from the FR first audio and the result converted to a time representation of the second audio file. In some embodiments, spectral subtraction is also performed to generate a further improved second audio file.
-
公开(公告)号:US20210209315A1
公开(公告)日:2021-07-08
申请号:US17056554
申请日:2020-03-07
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Melvin Johnson , Fadi Biadsy , Ron Weiss , Wolfgang Macherey
Abstract: The present disclosure provides systems and methods that train and use machine-learned models such as, for example, sequence-to-sequence models, to perform direct and text-free speech-to-speech translation. In particular, aspects of the present disclosure provide an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
-
公开(公告)号:US11594212B2
公开(公告)日:2023-02-28
申请号:US17155010
申请日:2021-01-21
Applicant: Google LLC
Inventor: Tara N. Sainath , Ruoming Pang , Ron Weiss , Yanzhang He , Chung-Cheng Chiu , Trevor Strohman
IPC: G10L15/06 , G06N3/08 , G10L15/16 , G10L15/197
Abstract: A method includes receiving a training example for a listen-attend-spell (LAS) decoder of a two-pass streaming neural network model and determining whether the training example corresponds to a supervised audio-text pair or an unpaired text sequence. When the training example corresponds to an unpaired text sequence, the method also includes determining a cross entropy loss based on a log probability associated with a context vector of the training example. The method also includes updating the LAS decoder and the context vector based on the determined cross entropy loss.
-
公开(公告)号:US11756534B2
公开(公告)日:2023-09-12
申请号:US17649058
申请日:2022-01-26
Applicant: Google LLC
Inventor: Bo Li , Ron Weiss , Michiel A. U. Bacchiani , Tara N. Sainath , Kevin William Wilson
IPC: G10L15/00 , G10L15/16 , G10L15/20 , G10L21/0224 , G10L15/26 , G10L21/0216
CPC classification number: G10L15/16 , G10L15/20 , G10L21/0224 , G10L15/26 , G10L2021/02166
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.
-
公开(公告)号:US20220246132A1
公开(公告)日:2022-08-04
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
IPC: G10L13/047 , G10L25/18 , G10L13/10 , G10L15/06 , G06N3/08
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US12032920B2
公开(公告)日:2024-07-09
申请号:US17056554
申请日:2020-03-07
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Melvin Johnson , Fadi Biadsy , Ron Weiss , Wolfgang Macherey
Abstract: The present disclosure provides systems and methods that train and use machine-learned models such as, for example, sequence-to-sequence models, to perform direct and text-free speech-to-speech translation. In particular, aspects of the present disclosure provide an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
-
公开(公告)号:US11475874B2
公开(公告)日:2022-10-18
申请号:US17163007
申请日:2021-01-29
Applicant: Google LLC
Inventor: Yu Zhang , Bhuvana Ramabhadran , Andrew Rosenberg , Yonghui Wu , Byungha Chun , Ron Weiss , Yuan Cao
Abstract: A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
-
公开(公告)号:US20220148582A1
公开(公告)日:2022-05-12
申请号:US17649058
申请日:2022-01-26
Applicant: Google LLC
Inventor: Bo Li , Ron Weiss , Michiel A.U. Bacchiani , Tara N. Sainath , Kevin William Wilson
IPC: G10L15/16 , G10L15/20 , G10L21/0224
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for neural network adaptive beamforming for multichannel speech recognition are disclosed. In one aspect, a method includes the actions of receiving a first channel of audio data corresponding to an utterance and a second channel of audio data corresponding to the utterance. The actions further include generating a first set of filter parameters for a first filter based on the first channel of audio data and the second channel of audio data and a second set of filter parameters for a second filter based on the first channel of audio data and the second channel of audio data. The actions further include generating a single combined channel of audio data. The actions further include inputting the audio data to a neural network. The actions further include providing a transcription for the utterance.
-
-
-
-
-
-
-
-
-