CASCADED ENCODERS FOR SIMPLIFIED STREAMING AND NON-STREAMING SPEECH RECOGNITION

    公开(公告)号:WO2022086589A1

    公开(公告)日:2022-04-28

    申请号:PCT/US2021/030364

    申请日:2021-05-01

    Applicant: GOOGLE LLC

    Abstract: An automated speech recognition (ASR) model (200) includes a first encoder (210), a second encoder (220), and a decoder (204). The first encoder receives, as input, a sequence of acoustic frames (110), and generates, at each of a plurality of output steps, a first higher order feature representation (203) for a corresponding acoustic frame. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation (205) for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.

    FAST EMIT LOW-LATENCY STREAMING ASR WITH SEQUENCE-LEVEL EMISSION REGULARIZATION

    公开(公告)号:WO2022086640A1

    公开(公告)日:2022-04-28

    申请号:PCT/US2021/049738

    申请日:2021-09-09

    Applicant: GOOGLE LLC

    Abstract: A computer-implemented method (400) of training a streaming speech recognition model (200) that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames (122). The streaming speech recognition model is configured to learn an alignment probability (206) between the sequence of acoustic frames and an output sequence of vocabulary tokens (204). The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability (264) of emitting one of the label tokens and determining a second probability (266) of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter (282) to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.

    SYSTEMS AND METHODS FOR TRAINING DUAL-MODE MACHINE-LEARNED SPEECH RECOGNITION MODELS

    公开(公告)号:WO2022072801A2

    公开(公告)日:2022-04-07

    申请号:PCT/US2021/053128

    申请日:2021-10-01

    Applicant: GOOGLE LLC

    Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

Patent Agency Ranking