-
公开(公告)号:US20240112667A1
公开(公告)日:2024-04-04
申请号:US18525475
申请日:2023-11-30
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US11942094B2
公开(公告)日:2024-03-26
申请号:US17211791
申请日:2021-03-24
Applicant: Google LLC
Inventor: Roza Chojnacka , Jason Pelecanos , Quan Wang , Ignacio Lopez Moreno
IPC: G10L17/02 , G06F16/9032 , G10L15/08
CPC classification number: G10L17/02 , G06F16/90332 , G10L2015/088
Abstract: A speaker verification method includes receiving audio data corresponding to an utterance, processing a first portion of the audio data that characterizes a predetermined hotword to generate a text-dependent evaluation vector, and generating one or more text-dependent confidence scores. When one of the text-dependent confidence scores satisfies a threshold, the operations include identifying a speaker of the utterance as a respective enrolled user associated with the text-dependent confidence score that satisfies the threshold and initiating performance of an action without performing speaker verification. When none of the text-dependent confidence scores satisfy the threshold, the operations include processing a second portion of the audio data that characterizes a query to generate a text-independent evaluation vector, generating one or more text-independent confidence scores, and determining whether the identity of the speaker of the utterance includes any of the enrolled users.
-
公开(公告)号:US11922951B2
公开(公告)日:2024-03-05
申请号:US17567590
申请日:2022-01-03
Applicant: GOOGLE LLC
Inventor: Quan Wang , Prashant Sridhar , Ignacio Lopez Moreno , Hannah Muckenhirn
Abstract: Techniques are disclosed that enable processing of audio data to generate one or more refined versions of audio data, where each of the refined versions of audio data isolate one or more utterances of a single respective human speaker. Various implementations generate a refined version of audio data that isolates utterance(s) of a single human speaker by processing a spectrogram representation of the audio data (generated by processing the audio data with a frequency transformation) using a mask generated by processing the spectrogram of the audio data and a speaker embedding for the single human speaker using a trained voice filter model. Output generated over the trained voice filter model is processed using an inverse of the frequency transformation to generate the refined audio data.
-
公开(公告)号:US20230298591A1
公开(公告)日:2023-09-21
申请号:US18123060
申请日:2023-03-17
Applicant: Google LLC
Inventor: Shaojin Ding , Rajeev Rikhye , Qiao Liang , Yanzhang He , Quan Wang , Arun Narayanan , Tom O'Malley , Ian McGraw
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames corresponding to an utterance and generating a reference speaker embedding for the utterance. The method also includes receiving a target speaker embedding for a target speaker and generating feature-wise linear modulation (FiLM) parameters including a scaling vector and a shifting vector based on the target speaker embedding. The method also includes generating an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The method also includes generating a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output.
-
公开(公告)号:US20230169984A1
公开(公告)日:2023-06-01
申请号:US18103324
申请日:2023-01-30
Applicant: Google LLC
Inventor: Rajeev Rikhye , Quan Wang , Yanzhang He , Qiao Liang , Ian C. McGraw
IPC: G10L17/24 , G10L17/06 , G10L21/028
CPC classification number: G10L17/24 , G10L17/06 , G10L21/028
Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.
-
公开(公告)号:US20230089308A1
公开(公告)日:2023-03-23
申请号:US17644261
申请日:2021-12-14
Applicant: Google LLC
Inventor: Quan Wang , Han Lu , Evan Clark , Ignacio Lopez Moreno , Hasim Sak , Wei Xia , Taral Joglekar , Anshuman Tripathi
Abstract: A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.
-
47.
公开(公告)号:US20230038982A1
公开(公告)日:2023-02-09
申请号:US17644108
申请日:2021-12-14
Applicant: Google LLC
Inventor: Arun Narayanan , Tom O'malley , Quan Wang , Alex Park , James Walker , Nathan David Howard , Yanzhang He , Chung-Cheng Chiu
IPC: G10L21/0216 , G10L15/06 , H04R3/04 , G06N3/04
Abstract: A method for automatic speech recognition using joint acoustic echo cancellation, speech enhancement, and voice separation includes receiving, at a contextual frontend processing model, input speech features corresponding to a target utterance. The method also includes receiving, at the contextual frontend processing model, at least one of a reference audio signal, a contextual noise signal including noise prior to the target utterance, or a speaker embedding including voice characteristics of a target speaker that spoke the target utterance. The method further includes processing, using the contextual frontend processing model, the input speech features and the at least one of the reference audio signal, the contextual noise signal, or the speaker embedding vector to generate enhanced speech features.
-
公开(公告)号:US20220366914A1
公开(公告)日:2022-11-17
申请号:US17302926
申请日:2021-05-16
Applicant: Google LLC
Inventor: Ignacio Lopez Moreno , Quan Wang , Jason Pelecanos , Yiling Huang , Mert Saglam
IPC: G10L17/06 , G10L17/18 , G10L17/04 , G06F16/245 , G06N3/08
Abstract: A speaker verification method includes receiving audio data corresponding to an utterance, processing the audio data to generate a reference attentive d-vector representing voice characteristics of the utterance, the evaluation ad-vector includes ne style classes each including a respective value vector concatenated with a corresponding routing vector. The method also includes generating using a self-attention mechanism, at least one multi-condition attention score that indicates a likelihood that the evaluation ad-vector matches a respective reference ad-vector associated with a respective user. The method also includes identifying the speaker of the utterance as the respective user associated with the respective reference ad-vector based on the multi-condition attention score.
-
公开(公告)号:US20220335953A1
公开(公告)日:2022-10-20
申请号:US17233253
申请日:2021-04-16
Applicant: Google LLC
Inventor: Rajeev Rikhye , Quan Wang , Yanzhang He , Qiao Liang , Ian C. McGraw
IPC: G10L17/24 , G10L21/028 , G10L17/06
Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.
-
公开(公告)号:US11468900B2
公开(公告)日:2022-10-11
申请号:US17071223
申请日:2020-10-15
Applicant: Google LLC
Inventor: Yeming Fang , Quan Wang , Pedro Jose Moreno Mengibar , Ignacio Lopez Moreno , Gang Feng , Fang Chu , Jin Shi , Jason William Pelecanos
Abstract: A method of generating an accurate speaker representation for an audio sample includes receiving a first audio sample from a first speaker and a second audio sample from a second speaker. The method includes dividing a respective audio sample into a plurality of audio slices. The method also includes, based on the plurality of slices, generating a set of candidate acoustic embeddings where each candidate acoustic embedding includes a vector representation of acoustic features. The method further includes removing a subset of the candidate acoustic embeddings from the set of candidate acoustic embeddings. The method additionally includes generating an aggregate acoustic embedding from the remaining candidate acoustic embeddings in the set of candidate acoustic embeddings after removing the subset of the candidate acoustic embeddings.
-
-
-
-
-
-
-
-
-