Patent search ap:("GOOGLE LLC") AND inv:"TRIPATHI Page Anshuman"

1.

发明申请
REDUCING STREAMING ASR MODEL DELAY WITH SELF ALIGNMENT 审中-公开

公开(公告)号：WO2022203735A1

公开(公告)日：2022-09-29

申请号：PCT/US2021/063465

申请日：2021-12-15

Applicant: GOOGLE LLC

Inventor： KIM, Jaeyoung , LU, Han , TRIPATHI, Anshuman , ZHANG, Qian , SAK, Hasim

IPC: G10L15/06 , G10L15/16

Abstract: A streaming speech recognition model (200) includes an audio encoder (210) configured to receive a sequence of acoustic frames (110) and generate a higher order feature representation (202) for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder (220) configured to receive a sequence of non-blank symbols output (242) by a final softmax layer (240) and generate a dense representation (222). The streaming speech recognition model also includes a joint network (230) configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution (232) over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

2.

发明申请
TRANSFORMER TRANSDUCER: ONE MODEL UNIFYING STREAMING AND NON-STREAMING SPEECH RECOGNITION 审中-公开

公开(公告)号：WO2022076029A1

公开(公告)日：2022-04-14

申请号：PCT/US2021/023052

申请日：2021-03-19

Applicant: GOOGLE LLC

Inventor： TRIPATHI, Anshuman , SAK, Hasim , LU, Han , ZHANG, Qian , KIM, Jaeyoung

IPC: G10L15/26 , G10L15/16

Abstract: A transformer-transducer model (200) includes an audio encoder (300), a label encoder (220), and a joint network (230). The audio encoder receives a sequence of acoustic frames (110), and generates, at each of a plurality of time steps, a higher order feature representation for each acoustic frame. The label encoder receives a sequence of non-blank symbols output by a softmax layer (240), and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypotheses. The audio encoder of the model further includes a neural network having an initial stack (310) of transformer layers (400) trained with zero look ahead audio context, and a final stack (320) of transformer layers (400) trained with a variable look ahead audio context.

3.

发明申请
SPEAKER-TURN-BASED ONLINE SPEAKER DIARIZATION WITH CONSTRAINED SPECTRAL CLUSTERING 审中-公开

公开(公告)号：WO2023048746A1

公开(公告)日：2023-03-30

申请号：PCT/US2021/063343

申请日：2021-12-14

Applicant: GOOGLE LLC

Inventor： WANG, Quan , LU, Han , CLARK, Evan , MORENNO, Ignacio, Lopez , SAK, Hasim , XU, Wei , JOGLEKAR, Taral , TRIPATHI, Anshuman

IPC: G10L21/0272 , G10L15/16 , G10L25/30

Abstract: A method (400) includes receiving an input audio signal (122) that corresponds to utterances (120) spoken by multiple speakers (10). The method also includes processing the input audio to generate a transcription (120) of the utterances and a sequence of speaker turn tokens (224) each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments (225) based on the sequence of speaker turn tokens. The method also includes extracting a speaker-discriminative embedding (240) from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes (262). The method also includes assigning a respective speaker label (250) to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to speaker segments clustered into each other class of the k classes.

4.

发明申请
END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION 审中-公开

公开(公告)号：WO2021222678A1

公开(公告)日：2021-11-04

申请号：PCT/US2021/030049

申请日：2021-04-30

Applicant: GOOGLE LLC

Inventor： TRIPATHI, Anshuman , LU, Han , SAK, Hasim

IPC: G10L15/06 , G10L15/16 , G06N3/08 , G10L15/20

Abstract: A method (400) for training a speech recognition model (200) with a loss function (310) includes receiving an audio signal (202) including a first segment (304) corresponding to audio spoken by a first speaker (10), a second segment corresponding to audio spoken by a second speaker, and an overlapping region (306) where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding (254) for each of the first and second speakers. The method also includes applying a masking loss (312) after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Patent Agency Ranking