-
公开(公告)号:WO2021222678A1
公开(公告)日:2021-11-04
申请号:PCT/US2021/030049
申请日:2021-04-30
Applicant: GOOGLE LLC
Inventor: TRIPATHI, Anshuman , LU, Han , SAK, Hasim
Abstract: A method (400) for training a speech recognition model (200) with a loss function (310) includes receiving an audio signal (202) including a first segment (304) corresponding to audio spoken by a first speaker (10), a second segment corresponding to audio spoken by a second speaker, and an overlapping region (306) where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding (254) for each of the first and second speakers. The method also includes applying a masking loss (312) after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.
-
公开(公告)号:WO2023048746A1
公开(公告)日:2023-03-30
申请号:PCT/US2021/063343
申请日:2021-12-14
Applicant: GOOGLE LLC
Inventor: WANG, Quan , LU, Han , CLARK, Evan , MORENNO, Ignacio, Lopez , SAK, Hasim , XU, Wei , JOGLEKAR, Taral , TRIPATHI, Anshuman
IPC: G10L21/0272 , G10L15/16 , G10L25/30
Abstract: A method (400) includes receiving an input audio signal (122) that corresponds to utterances (120) spoken by multiple speakers (10). The method also includes processing the input audio to generate a transcription (120) of the utterances and a sequence of speaker turn tokens (224) each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments (225) based on the sequence of speaker turn tokens. The method also includes extracting a speaker-discriminative embedding (240) from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes (262). The method also includes assigning a respective speaker label (250) to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to speaker segments clustered into each other class of the k classes.
-
公开(公告)号:WO2019027531A1
公开(公告)日:2019-02-07
申请号:PCT/US2018/032681
申请日:2018-05-15
Applicant: GOOGLE LLC
Inventor: SAK, Hasim , MORENO, Ignacio Lopez , PAPIR, Alan Sean , WAN, Li , WANG, Quan
Abstract: Systems, methods, devices, and other techniques for training and using a speaker verification neural network. A computing device may receive data that characterizes a first utterance. The computing device provides the data that characterizes the utterance to a speaker verification neural network. Subsequently, the computing device obtains, from the speaker verification neural network, a speaker representation that indicates speaking characteristics of a speaker of the first utterance. The computing device determines whether the first utterance is classified as an utterance of a registered user of the computing device. In response to determining that the first utterance is classified as an utterance of the registered user of the computing device, the device may perform an action for the registered user of the computing device.
-
公开(公告)号:WO2022203735A1
公开(公告)日:2022-09-29
申请号:PCT/US2021/063465
申请日:2021-12-15
Applicant: GOOGLE LLC
Inventor: KIM, Jaeyoung , LU, Han , TRIPATHI, Anshuman , ZHANG, Qian , SAK, Hasim
Abstract: A streaming speech recognition model (200) includes an audio encoder (210) configured to receive a sequence of acoustic frames (110) and generate a higher order feature representation (202) for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder (220) configured to receive a sequence of non-blank symbols output (242) by a final softmax layer (240) and generate a dense representation (222). The streaming speech recognition model also includes a joint network (230) configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution (232) over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.
-
5.
公开(公告)号:WO2022076029A1
公开(公告)日:2022-04-14
申请号:PCT/US2021/023052
申请日:2021-03-19
Applicant: GOOGLE LLC
Inventor: TRIPATHI, Anshuman , SAK, Hasim , LU, Han , ZHANG, Qian , KIM, Jaeyoung
Abstract: A transformer-transducer model (200) includes an audio encoder (300), a label encoder (220), and a joint network (230). The audio encoder receives a sequence of acoustic frames (110), and generates, at each of a plurality of time steps, a higher order feature representation for each acoustic frame. The label encoder receives a sequence of non-blank symbols output by a softmax layer (240), and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypotheses. The audio encoder of the model further includes a neural network having an initial stack (310) of transformer layers (400) trained with zero look ahead audio context, and a final stack (320) of transformer layers (400) trained with a variable look ahead audio context.
-
公开(公告)号:WO2018118442A1
公开(公告)日:2018-06-28
申请号:PCT/US2017/065023
申请日:2017-12-07
Applicant: GOOGLE LLC
Inventor: SOLTAU, Hagen , SAK, Hasim , LIAO, Hank
CPC classification number: G10L15/16 , G06N3/0445 , G06N3/084 , G10L15/02 , G10L15/063 , G10L15/14 , G10L15/22 , G10L21/10
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for large vocabulary continuous speech recognition. One method includes receiving audio data representing an utterance of a speaker. Acoustic features of the audio data are provided to a recurrent neural network trained using connectionist temporal classification to estimate likelihoods of occurrence of whole words based on acoustic feature input. Output of the recurrent neural network generated in response to the acoustic features is received. The output indicates a likelihood of occurrence for each of multiple different words in a vocabulary. A transcription for the utterance is generated based on the output of the recurrent neural network. The transcription is provided as output of the automated speech recognition system. The described methods, systems, and apparatus may provide end-to-end speech recognition with neural networks.
-
-
-
-
-