Patent search ap:("GOOGLE LLC") AND inv:"SAK Page Hasim"

1.

发明申请
END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION 审中-公开

公开(公告)号：WO2021222678A1

公开(公告)日：2021-11-04

申请号：PCT/US2021/030049

申请日：2021-04-30

Applicant: GOOGLE LLC

Inventor： TRIPATHI, Anshuman , LU, Han , SAK, Hasim

IPC: G10L15/06 , G10L15/16 , G06N3/08 , G10L15/20

Abstract: A method (400) for training a speech recognition model (200) with a loss function (310) includes receiving an audio signal (202) including a first segment (304) corresponding to audio spoken by a first speaker (10), a second segment corresponding to audio spoken by a second speaker, and an overlapping region (306) where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding (254) for each of the first and second speakers. The method also includes applying a masking loss (312) after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

2.

发明申请
SPEAKER-TURN-BASED ONLINE SPEAKER DIARIZATION WITH CONSTRAINED SPECTRAL CLUSTERING 审中-公开

公开(公告)号：WO2023048746A1

公开(公告)日：2023-03-30

申请号：PCT/US2021/063343

申请日：2021-12-14

Applicant: GOOGLE LLC

Inventor： WANG, Quan , LU, Han , CLARK, Evan , MORENNO, Ignacio, Lopez , SAK, Hasim , XU, Wei , JOGLEKAR, Taral , TRIPATHI, Anshuman

IPC: G10L21/0272 , G10L15/16 , G10L25/30

Abstract: A method (400) includes receiving an input audio signal (122) that corresponds to utterances (120) spoken by multiple speakers (10). The method also includes processing the input audio to generate a transcription (120) of the utterances and a sequence of speaker turn tokens (224) each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments (225) based on the sequence of speaker turn tokens. The method also includes extracting a speaker-discriminative embedding (240) from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes (262). The method also includes assigning a respective speaker label (250) to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to speaker segments clustered into each other class of the k classes.

3.

发明申请
NEURAL NETWORKS FOR SPEAKER VERIFICATION 审中-公开

公开(公告)号：WO2019027531A1

公开(公告)日：2019-02-07

申请号：PCT/US2018/032681

申请日：2018-05-15

Applicant: GOOGLE LLC

Inventor： SAK, Hasim , MORENO, Ignacio Lopez , PAPIR, Alan Sean , WAN, Li , WANG, Quan

IPC: G10L17/04 , G10L17/18 , G10L17/02 , G10L17/00

CPC classification number: G10L17/22 , G10L17/005 , G10L17/02 , G10L17/04 , G10L17/06 , G10L17/18

Abstract: Systems, methods, devices, and other techniques for training and using a speaker verification neural network. A computing device may receive data that characterizes a first utterance. The computing device provides the data that characterizes the utterance to a speaker verification neural network. Subsequently, the computing device obtains, from the speaker verification neural network, a speaker representation that indicates speaking characteristics of a speaker of the first utterance. The computing device determines whether the first utterance is classified as an utterance of a registered user of the computing device. In response to determining that the first utterance is classified as an utterance of the registered user of the computing device, the device may perform an action for the registered user of the computing device.

4.

发明申请
REDUCING STREAMING ASR MODEL DELAY WITH SELF ALIGNMENT 审中-公开

公开(公告)号：WO2022203735A1

公开(公告)日：2022-09-29

申请号：PCT/US2021/063465

申请日：2021-12-15

Applicant: GOOGLE LLC

Inventor： KIM, Jaeyoung , LU, Han , TRIPATHI, Anshuman , ZHANG, Qian , SAK, Hasim

IPC: G10L15/06 , G10L15/16

Abstract: A streaming speech recognition model (200) includes an audio encoder (210) configured to receive a sequence of acoustic frames (110) and generate a higher order feature representation (202) for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder (220) configured to receive a sequence of non-blank symbols output (242) by a final softmax layer (240) and generate a dense representation (222). The streaming speech recognition model also includes a joint network (230) configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution (232) over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

5.

发明申请
TRANSFORMER TRANSDUCER: ONE MODEL UNIFYING STREAMING AND NON-STREAMING SPEECH RECOGNITION 审中-公开

公开(公告)号：WO2022076029A1

公开(公告)日：2022-04-14

申请号：PCT/US2021/023052

申请日：2021-03-19

Applicant: GOOGLE LLC

Inventor： TRIPATHI, Anshuman , SAK, Hasim , LU, Han , ZHANG, Qian , KIM, Jaeyoung

IPC: G10L15/26 , G10L15/16

Abstract: A transformer-transducer model (200) includes an audio encoder (300), a label encoder (220), and a joint network (230). The audio encoder receives a sequence of acoustic frames (110), and generates, at each of a plurality of time steps, a higher order feature representation for each acoustic frame. The label encoder receives a sequence of non-blank symbols output by a softmax layer (240), and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypotheses. The audio encoder of the model further includes a neural network having an initial stack (310) of transformer layers (400) trained with zero look ahead audio context, and a final stack (320) of transformer layers (400) trained with a variable look ahead audio context.

6.

发明申请
ACOUSTIC-TO-WORD NEURAL NETWORK SPEECH RECOGNIZER 审中-公开

公开(公告)号：WO2018118442A1

公开(公告)日：2018-06-28

申请号：PCT/US2017/065023

申请日：2017-12-07

Applicant: GOOGLE LLC

Inventor： SOLTAU, Hagen , SAK, Hasim , LIAO, Hank

IPC: G10L15/02 , G10L15/16

CPC classification number: G10L15/16 , G06N3/0445 , G06N3/084 , G10L15/02 , G10L15/063 , G10L15/14 , G10L15/22 , G10L21/10

Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for large vocabulary continuous speech recognition. One method includes receiving audio data representing an utterance of a speaker. Acoustic features of the audio data are provided to a recurrent neural network trained using connectionist temporal classification to estimate likelihoods of occurrence of whole words based on acoustic feature input. Output of the recurrent neural network generated in response to the acoustic features is received. The output indicates a likelihood of occurrence for each of multiple different words in a vocabulary. A transcription for the utterance is generated based on the output of the recurrent neural network. The transcription is provided as output of the automated speech recognition system. The described methods, systems, and apparatus may provide end-to-end speech recognition with neural networks.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification