REDUCING STREAMING ASR MODEL DELAY WITH SELF ALIGNMENT

    公开(公告)号:WO2022203735A1

    公开(公告)日:2022-09-29

    申请号:PCT/US2021/063465

    申请日:2021-12-15

    Applicant: GOOGLE LLC

    Abstract: A streaming speech recognition model (200) includes an audio encoder (210) configured to receive a sequence of acoustic frames (110) and generate a higher order feature representation (202) for a corresponding acoustic frame in the sequence of acoustic frames. The streaming speech recognition model also includes a label encoder (220) configured to receive a sequence of non-blank symbols output (242) by a final softmax layer (240) and generate a dense representation (222). The streaming speech recognition model also includes a joint network (230) configured to receive the higher order feature representation generated by the audio encoder and the dense representation generated by the label encoder and generate a probability distribution (232) over possible speech recognition hypotheses. Here, the streaming speech recognition model is trained using self-alignment to reduce prediction delay by encouraging an alignment path that is one frame left from a reference forced-alignment frame.

    TRANSFORMER TRANSDUCER: ONE MODEL UNIFYING STREAMING AND NON-STREAMING SPEECH RECOGNITION

    公开(公告)号:WO2022076029A1

    公开(公告)日:2022-04-14

    申请号:PCT/US2021/023052

    申请日:2021-03-19

    Applicant: GOOGLE LLC

    Abstract: A transformer-transducer model (200) includes an audio encoder (300), a label encoder (220), and a joint network (230). The audio encoder receives a sequence of acoustic frames (110), and generates, at each of a plurality of time steps, a higher order feature representation for each acoustic frame. The label encoder receives a sequence of non-blank symbols output by a softmax layer (240), and generates, at each of the plurality of time steps, a dense representation. The joint network receives the higher order feature representation and the dense representation at each of the plurality of time steps, and generates a probability distribution over possible speech recognition hypotheses. The audio encoder of the model further includes a neural network having an initial stack (310) of transformer layers (400) trained with zero look ahead audio context, and a final stack (320) of transformer layers (400) trained with a variable look ahead audio context.

    SPEAKER-TURN-BASED ONLINE SPEAKER DIARIZATION WITH CONSTRAINED SPECTRAL CLUSTERING

    公开(公告)号:WO2023048746A1

    公开(公告)日:2023-03-30

    申请号:PCT/US2021/063343

    申请日:2021-12-14

    Applicant: GOOGLE LLC

    Abstract: A method (400) includes receiving an input audio signal (122) that corresponds to utterances (120) spoken by multiple speakers (10). The method also includes processing the input audio to generate a transcription (120) of the utterances and a sequence of speaker turn tokens (224) each indicating a location of a respective speaker turn. The method also includes segmenting the input audio signal into a plurality of speaker segments (225) based on the sequence of speaker turn tokens. The method also includes extracting a speaker-discriminative embedding (240) from each speaker segment and performing spectral clustering on the speaker-discriminative embeddings to cluster the plurality of speaker segments into k classes (262). The method also includes assigning a respective speaker label (250) to each speaker segment clustered into the respective class that is different than the respective speaker label assigned to speaker segments clustered into each other class of the k classes.

    END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION

    公开(公告)号:WO2021222678A1

    公开(公告)日:2021-11-04

    申请号:PCT/US2021/030049

    申请日:2021-04-30

    Applicant: GOOGLE LLC

    Abstract: A method (400) for training a speech recognition model (200) with a loss function (310) includes receiving an audio signal (202) including a first segment (304) corresponding to audio spoken by a first speaker (10), a second segment corresponding to audio spoken by a second speaker, and an overlapping region (306) where the first segment overlaps the second segment. The overlapping region includes a known start time and a known end time. The method also includes generating a respective masked audio embedding (254) for each of the first and second speakers. The method also includes applying a masking loss (312) after the known end time to the respective masked audio embedding for the first speaker when the first speaker was speaking prior to the known start time, or applying the masking loss prior to the known start time when the first speaker was speaking after the known end time.

Patent Agency Ranking