-
公开(公告)号:US20230419989A1
公开(公告)日:2023-12-28
申请号:US17808653
申请日:2022-06-24
Applicant: Google LLC
Inventor: Beat Gfeller , Kevin Ian Kilgour , Marco Tagliasacchi , Aren Jansen , Scott Thomas Wisdom , Qingqing Huang
CPC classification number: G10L25/84 , G10L15/16 , G10L15/063 , G06N3/0454
Abstract: Example methods include receiving training data comprising a plurality of audio clips and a plurality of textual descriptions of audio. The methods include generating a shared representation comprising a joint embedding. An audio embedding of a given audio clip is within a threshold distance of a text embedding of a textual description of the given audio clip. The methods include generating, based on the joint embedding, a conditioning vector and training, based on the conditioning vector, a neural network to: receive (i) an input audio waveform, and (ii) an input comprising one or more of an input textual description of a target audio source in the input audio waveform, or an audio sample of the target audio source, separate audio corresponding to the target audio source from the input audio waveform, and output the separated audio corresponding to the target audio source in response to the receiving of the input.
-
公开(公告)号:US20230308823A1
公开(公告)日:2023-09-28
申请号:US18042258
申请日:2020-08-26
Applicant: Manoj PLAKAL , Dan ELLIS , Shawn HERSHEY , Richard Channing MOORE, III , Aren JANSEN , Google LLC
Inventor: Aren Jansen , Manoj Plakal , Dan Ellis , Shawn Hershey , Richard Channing Moore, III
IPC: H04S7/00
CPC classification number: H04S7/301 , H04S2400/01
Abstract: A computer-implemented method for upmixing audiovisual data can include obtaining audiovisual data including input audio data and video data accompanying the input audio data. Each frame of the video data can depict only a portion of a larger scene. The input audio data can have a first number of audio channels. The computer-implemented method can include providing the audiovisual data as input to a machine-learned audiovisual upmixing model. The audiovisual upmixing model can include a sequence-to-sequence model configured to model a respective location of one or more audio sources within the larger scene over multiple frames of the video data. The computer-implemented method can include receiving upmixed audio data from the audiovisual upmixing model. The upmixed audio data can have a second number of audio channels. The second number of audio channels can be greater than the first number of audio channels.
-
公开(公告)号:US11756570B2
公开(公告)日:2023-09-12
申请号:US17214186
申请日:2021-03-26
Applicant: Google LLC
Inventor: Efthymios Tzinis , Scott Wisdom , Aren Jansen , John R Hershey
IPC: G10L25/57 , G06N3/088 , G10L25/30 , G06V20/40 , G06F18/214
CPC classification number: G10L25/57 , G06F18/214 , G06N3/088 , G06V20/40 , G10L25/30
Abstract: Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.
-
公开(公告)号:US20220059117A1
公开(公告)日:2022-02-24
申请号:US17000583
申请日:2020-08-24
Applicant: Google LLC
Inventor: Joel Shor , Ronnie Maor , Oran Lang , Omry Tuval , Marco Tagliasacchi , Ira Shavitt , Felix de Chaumont Quitry , Dotan Emanuel , Aren Jansen
Abstract: Examples relate to on-device non-semantic representation fine-tuning for speech classification. A computing system may obtain audio data having a speech portion and train a neural network to learn a non-semantic speech representation based on the speech portion of the audio data. The computing system may evaluate performance of the non-semantic speech representation based on a set of benchmark tasks corresponding to a speech domain and perform a fine-tuning process on the non-semantic speech representation based on one or more downstream tasks. The computing system may further generate a model based on the non-semantic representation and provide the model to a mobile computing device. The model is configured to operate locally on the mobile computing device.
-
15.
公开(公告)号:US20240366148A1
公开(公告)日:2024-11-07
申请号:US18773046
申请日:2024-07-15
Applicant: Google LLC
Inventor: Katherine Chou , Michael Dwight Howell , Kasumi Widner , Ryan Rifkin , Henry George Wei , Daniel Ellis , Alvin Rajkomar , Aren Jansen , David Michael Parish , Michael Philip Brenner
Abstract: The present disclosure provides systems and methods that generating health diagnostic information from an audio recording. A computing system can include a machine-learned health model comprising that includes a sound model trained to receive data descriptive of a patient audio recording and output sound description data. The computing system can include a diagnostic model trained to receive the sound description data and output a diagnostic score. The computing system can include at least one tangible, non-transitory computer-readable medium that stores instructions that, when executed, cause the processor to perform operations. The operations can include obtaining the patient audio recording; inputting data descriptive of the patient audio recording into the sound model; receiving, as an output of the sound model, the sound description data; inputting the sound description data into the diagnostic model; and receiving, as an output of the diagnostic model, the diagnostic score.
-
公开(公告)号:US20230386502A1
公开(公告)日:2023-11-30
申请号:US18226545
申请日:2023-07-26
Applicant: Google LLC
Inventor: Efthymios Tzinis , Scott Wisdom , Aren Jansen , John R. Hershey
IPC: G10L25/57 , G06N3/088 , G10L25/30 , G06V20/40 , G06F18/214
CPC classification number: G10L25/57 , G06N3/088 , G10L25/30 , G06V20/40 , G06F18/214
Abstract: Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.
-
公开(公告)号:US11823439B2
公开(公告)日:2023-11-21
申请号:US17428659
申请日:2020-01-16
Applicant: Google LLC
Inventor: Aren Jansen , Malcolm Slaney
IPC: G06V10/774 , G06V40/10 , G06V10/80
CPC classification number: G06V10/774 , G06V10/811 , G06V40/15
Abstract: Generally, the present disclosure is directed to systems and methods that train machine-learned models (e.g., artificial neural networks) to perform perceptual or cognitive task(s) based on biometric data (e.g., brain wave recordings) collected from living organism(s) while the living organism(s) are performing the perceptual or cognitive task(s). In particular, aspects of the present disclosure are directed to a new supervision paradigm, by which machine-learned feature extraction models are trained using example stimuli paired with companion biometric data such as neural activity recordings (e g electroencephalogram data, electrocorticography data, functional near-infrared spectroscopy, and/or magnetoencephalography data) collected from a living organism (e.g., human being) while the organism perceived those examples (e.g., viewing the image, listening to the speech, etc.).
-
公开(公告)号:US11475236B2
公开(公告)日:2022-10-18
申请号:US16880456
申请日:2020-05-21
Applicant: Google LLC
Inventor: Aren Jansen , Ryan Michael Rifkin , Daniel Ellis
Abstract: A computing system can include an embedding model and a clustering model. The computing system input each of the plurality of inputs into the embedding model and receiving respective embeddings for the plurality of inputs as outputs of the embedding model. The computing system can input the respective embeddings for the plurality of inputs into the clustering model and receiving respective cluster assignments for the plurality of inputs as outputs of the clustering model. The computing system can evaluate a clustering loss function that evaluates a first average, across the plurality of inputs, of a respective first entropy of each respective probability distribution; and a second entropy of a second average of the probability distributions for the plurality of inputs. The computing system can modify parameter(s) of one or both of the clustering model and the embedding model based on the clustering loss function.
-
-
-
-
-
-
-