-
公开(公告)号:US20240395233A1
公开(公告)日:2024-11-28
申请号:US18671577
申请日:2024-05-22
Applicant: Google LLC
Inventor: Adam Joseph Roberts , Jesse Hart Engel , Ian Stuart Simon , Andrea Agostinelli , Neil Zeghidour , Christopher James Donahue , Antoine Caillon
IPC: G10H1/00 , G10H1/36 , G10L15/06 , G10L15/18 , G10L15/183
Abstract: Training data comprising a plurality of training pairs is obtained. Each training pair comprises instrumental audio data and vocal audio data separated from audio data of a musical work of a respective plurality of musical works. For one or more training pairs of the plurality of training pairs, the vocal audio data is processed with machine-learned model(s) of a machine-learned generative audio model grouping to obtain a vocal intermediate representation for the vocal audio data. The instrumental audio data is processed with a pre-trained encoding model to obtain an instrumental intermediate representation for the instrumental audio data. A loss function is evaluated that evaluates a difference between the vocal intermediate representation and the instrumental intermediate representation. Values of parameters of a machine-learned model of the machine-learned generative audio model grouping are modified based on the loss function.
-
公开(公告)号:US20240296331A1
公开(公告)日:2024-09-05
申请号:US18437202
申请日:2024-02-08
Applicant: Google LLC
Inventor: David Wilson Romero Guzman , Neil Zeghidour
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for jointly learning the architecture of a neural network during the training of the neural network. In particular, the architecture of the neural network is learned using differentiable parametric masks.
-
公开(公告)号:US11600282B2
公开(公告)日:2023-03-07
申请号:US17856856
申请日:2022-07-01
Applicant: Google LLC
Inventor: Neil Zeghidour , Marco Tagliasacchi , Dominik Roblek
IPC: G10L19/038 , G10L25/30 , G10L19/00 , G06N3/08 , G06N3/04
Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media. One of the methods includes receiving an audio waveform that includes a respective audio sample for each of a plurality of time steps, processing the audio waveform using an encoder neural network to generate a plurality of feature vectors representing the audio waveform, generating a respective coded representation of each of the plurality of feature vectors using a plurality of vector quantizers that are each associated with a respective codebook of code vectors, wherein the respective coded representation of each feature vector identifies a plurality of code vectors, including a respective code vector from the codebook of each vector quantizer, that define a quantized representation of the feature vector, and generating a compressed representation of the audio waveform by compressing the respective coded representation of each of the plurality of feature vectors.
-
公开(公告)号:US20250005354A1
公开(公告)日:2025-01-02
申请号:US18698691
申请日:2022-10-05
Applicant: Google LLC
Inventor: Neil Zeghidour , Rachid Riad , Olivier Teboul , David Grangier
IPC: G06N3/08
Abstract: A method of training a machine learning model, includes receiving training data for the machine learning model, wherein the training data comprises a plurality of batches. The method also includes applying a downsampling layer of the machine learning model to the plurality of batches of the training data to determine a stride comprising a learnable parameter for the downsampling layer. Applying the downsampling layer of the machine learning model to a batch of the training data includes projecting an input in a spatial domain to a Fourier domain, constructing a mask in the Fourier domain based on a current value of the stride and dimensions of the input, applying the mask as a low-pass filter to the projected input to produce a tensor in the Fourier domain, cropping the tensor based on the mask, and transforming the cropped tensor to the spatial domain.
-
公开(公告)号:US11990148B2
公开(公告)日:2024-05-21
申请号:US18106094
申请日:2023-02-06
Applicant: Google LLC
Inventor: Neil Zeghidour , Marco Tagliasacchi , Dominik Roblek
IPC: G10L19/038 , G06N3/045 , G06N3/08 , G10L19/00 , G10L25/30
CPC classification number: G10L19/038 , G06N3/045 , G06N3/08 , G10L25/30 , G10L2019/0002
Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media. One of the methods includes receiving an audio waveform that includes a respective audio sample for each of a plurality of time steps, processing the audio waveform using an encoder neural network to generate a plurality of feature vectors representing the audio waveform, generating a respective coded representation of each of the plurality of feature vectors using a plurality of vector quantizers that are each associated with a respective codebook of code vectors, wherein the respective coded representation of each feature vector identifies a plurality of code vectors, including a respective code vector from the codebook of each vector quantizer, that define a quantized representation of the feature vector, and generating a compressed representation of the audio waveform by compressing the respective coded representation of each of the plurality of feature vectors.
-
公开(公告)号:US20240079001A1
公开(公告)日:2024-03-07
申请号:US18463196
申请日:2023-09-07
Applicant: Google LLC
Inventor: Andrea Agostinelli , Timo Immanuel Denk , Antoine Caillon , Neil Zeghidour , Jesse Engel , Mauro Verzetti , Christian Frank , Zalán Borsos , Matthew Sharifi , Adam Joseph Roberts
CPC classification number: G10L15/16 , G10H1/0008 , G10L15/063 , G10L15/1815 , G10H2210/056 , G10H2250/311
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating a prediction of an audio signal. One of the methods includes receiving a request to generate an audio signal conditioned on an input; processing the input using an embedding neural network to map the input to one or more embedding tokens; generating a semantic representation of the audio signal; generating, using one or more generative neural networks and conditioned on at least the semantic representation and the embedding tokens, an acoustic representation of the audio signal; and processing at least the acoustic representation using a decoder neural network to generate the prediction of the audio signal.
-
公开(公告)号:US11887623B2
公开(公告)日:2024-01-30
申请号:US17304514
申请日:2021-06-22
Applicant: Google LLC
Inventor: David Grangier , Neil Zeghidour , Oliver Teboul
CPC classification number: G10L25/78 , G06N3/04 , G10L15/063 , G10L15/07 , G10L17/18 , G10L19/008
Abstract: A method includes receiving an input audio signal corresponding to utterances spoken by multiple speakers. The method also includes encoding the input audio signal into a sequence of T temporal embeddings. During each of a plurality of iterations each corresponding to a respective speaker of the multiple speakers, the method includes selecting a respective speaker embedding for the respective speaker by determining a probability that the corresponding temporal embedding includes a presence of voice activity by a single new speaker for which a speaker embedding was not previously selected during a previous iteration and selecting the respective speaker embedding for the respective speaker as the temporal embedding. The method also includes, at each time step, predicting a respective voice activity indicator for each respective speaker of the multiple speakers based on the respective speaker embeddings selected during the plurality of iterations and the temporal embedding.
-
公开(公告)号:US20230377561A1
公开(公告)日:2023-11-23
申请号:US18029843
申请日:2021-10-04
Applicant: Google LLC
Inventor: Neil Zeghidour , Olivier Teboul , Félix de Chaumont Quitry , Marco Tagliasacchi
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing audio inputs using a learned audio frontend machine learning model that processes the audio input to generate a representation of the audio input. The representation can then be processed by an audio understanding model to generate a respective output for each of one or more audio understanding tasks.
-
公开(公告)号:US20230112265A1
公开(公告)日:2023-04-13
申请号:US17967726
申请日:2022-10-17
Applicant: Google LLC
Inventor: Neil Zeghidour , David Grangier
IPC: G10L21/028 , G06N3/08 , G10L17/04 , G10L17/18 , G10L21/0316 , G06N3/045
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing speech separation. One of the methods includes obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values and configured to process the recording in accordance with the speaker parameter values to generate a plurality of per-recording speaker representations, each speaker representation representing features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the speaker representations in accordance with the separation parameter values to generate, for each speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording.
-
公开(公告)号:US20230019128A1
公开(公告)日:2023-01-19
申请号:US17856856
申请日:2022-07-01
Applicant: Google LLC
Inventor: Neil Zeghidour , Marco Tagliasacchi , Dominik Roblek
IPC: G10L19/038 , G10L25/30 , G06N3/04 , G06N3/08
Abstract: Methods, systems and apparatus, including computer programs encoded on computer storage media. One of the methods includes receiving an audio waveform that includes a respective audio sample for each of a plurality of time steps, processing the audio waveform using an encoder neural network to generate a plurality of feature vectors representing the audio waveform, generating a respective coded representation of each of the plurality of feature vectors using a plurality of vector quantizers that are each associated with a respective codebook of code vectors, wherein the respective coded representation of each feature vector identifies a plurality of code vectors, including a respective code vector from the codebook of each vector quantizer, that define a quantized representation of the feature vector, and generating a compressed representation of the audio waveform by compressing the respective coded representation of each of the plurality of feature vectors.
-
-
-
-
-
-
-
-
-