Hybrid multilingual text-dependent and text-independent speaker verification

    公开(公告)号:US11942094B2

    公开(公告)日:2024-03-26

    申请号:US17211791

    申请日:2021-03-24

    Applicant: Google LLC

    CPC classification number: G10L17/02 G06F16/90332 G10L2015/088

    Abstract: A speaker verification method includes receiving audio data corresponding to an utterance, processing a first portion of the audio data that characterizes a predetermined hotword to generate a text-dependent evaluation vector, and generating one or more text-dependent confidence scores. When one of the text-dependent confidence scores satisfies a threshold, the operations include identifying a speaker of the utterance as a respective enrolled user associated with the text-dependent confidence score that satisfies the threshold and initiating performance of an action without performing speaker verification. When none of the text-dependent confidence scores satisfy the threshold, the operations include processing a second portion of the audio data that characterizes a query to generate a text-independent evaluation vector, generating one or more text-independent confidence scores, and determining whether the identity of the speaker of the utterance includes any of the enrolled users.

    Optimizing Personal VAD for On-Device Speech Recognition

    公开(公告)号:US20230298591A1

    公开(公告)日:2023-09-21

    申请号:US18123060

    申请日:2023-03-17

    Applicant: Google LLC

    CPC classification number: G10L17/06 G10L17/22

    Abstract: A computer-implemented method includes receiving a sequence of acoustic frames corresponding to an utterance and generating a reference speaker embedding for the utterance. The method also includes receiving a target speaker embedding for a target speaker and generating feature-wise linear modulation (FiLM) parameters including a scaling vector and a shifting vector based on the target speaker embedding. The method also includes generating an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The method also includes generating a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output.

    VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION

    公开(公告)号:US20230169984A1

    公开(公告)日:2023-06-01

    申请号:US18103324

    申请日:2023-01-30

    Applicant: Google LLC

    CPC classification number: G10L17/24 G10L17/06 G10L21/028

    Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.

    Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering

    公开(公告)号:US20230089308A1

    公开(公告)日:2023-03-23

    申请号:US17644261

    申请日:2021-12-14

    Applicant: Google LLC

    Abstract: A method includes receiving an input audio signal that corresponds to utterances spoken by multiple speakers. The method also includes processing the input audio to generate a transcription of the utterances and a sequence of speaker turn tokens each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments based on the sequence of speaker tokens. The method also includes extracting a speaker-discriminative embedding from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes. The method also includes assigning a respective speaker label to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to the speaker segments clustered into each other class of the k classes.

    Attentive Scoring Function for Speaker Identification

    公开(公告)号:US20220366914A1

    公开(公告)日:2022-11-17

    申请号:US17302926

    申请日:2021-05-16

    Applicant: Google LLC

    Abstract: A speaker verification method includes receiving audio data corresponding to an utterance, processing the audio data to generate a reference attentive d-vector representing voice characteristics of the utterance, the evaluation ad-vector includes ne style classes each including a respective value vector concatenated with a corresponding routing vector. The method also includes generating using a self-attention mechanism, at least one multi-condition attention score that indicates a likelihood that the evaluation ad-vector matches a respective reference ad-vector associated with a respective user. The method also includes identifying the speaker of the utterance as the respective user associated with the respective reference ad-vector based on the multi-condition attention score.

    VOICE SHORTCUT DETECTION WITH SPEAKER VERIFICATION

    公开(公告)号:US20220335953A1

    公开(公告)日:2022-10-20

    申请号:US17233253

    申请日:2021-04-16

    Applicant: Google LLC

    Abstract: Techniques disclosed herein are directed towards streaming keyphrase detection which can be customized to detect one or more particular keyphrases, without requiring retraining of any model(s) for those particular keyphrase(s). Many implementations include processing audio data using a speaker separation model to generate separated audio data which isolates an utterance spoken by a human speaker from one or more additional sounds not spoken by the human speaker, and processing the separated audio data using a text independent speaker identification model to determine whether a verified and/or registered user spoke a spoken utterance captured in the audio data. Various implementations include processing the audio data and/or the separated audio data using an automatic speech recognition model to generate a text representation of the utterance. Additionally or alternatively, the text representation of the utterance can be processed to determine whether at least a portion of the text representation of the utterance captures a particular keyphrase. When the system determines the registered and/or verified user spoke the utterance and the system determines the text representation of the utterance captures the particular keyphrase, the system can cause a computing device to perform one or more actions corresponding to the particular keyphrase.

    Speaker identification accuracy
    50.
    发明授权

    公开(公告)号:US11468900B2

    公开(公告)日:2022-10-11

    申请号:US17071223

    申请日:2020-10-15

    Applicant: Google LLC

    Abstract: A method of generating an accurate speaker representation for an audio sample includes receiving a first audio sample from a first speaker and a second audio sample from a second speaker. The method includes dividing a respective audio sample into a plurality of audio slices. The method also includes, based on the plurality of slices, generating a set of candidate acoustic embeddings where each candidate acoustic embedding includes a vector representation of acoustic features. The method further includes removing a subset of the candidate acoustic embeddings from the set of candidate acoustic embeddings. The method additionally includes generating an aggregate acoustic embedding from the remaining candidate acoustic embeddings in the set of candidate acoustic embeddings after removing the subset of the candidate acoustic embeddings.

Patent Agency Ranking