Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models

    公开(公告)号:US20230237993A1

    公开(公告)日:2023-07-27

    申请号:US18011571

    申请日:2021-10-01

    Applicant: Google LLC

    CPC classification number: G10L15/16 G10L15/32 G10L15/22

    Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.

    Speech recognition with sequence-to-sequence models

    公开(公告)号:US11335333B2

    公开(公告)日:2022-05-17

    申请号:US16717746

    申请日:2019-12-17

    Applicant: Google LLC

    Abstract: A method includes obtaining audio data for a long-form utterance and segmenting the audio data for the long-form utterance into a plurality of overlapping segments. The method also includes, for each overlapping segment of the plurality of overlapping segments: providing features indicative of acoustic characteristics of the long-form utterance represented by the corresponding overlapping segment as input to an encoder neural network; processing an output of the encoder neural network using an attender neural network to generate a context vector; and generating word elements using the context vector and a decoder neural network. The method also includes generating a transcription for the long-form utterance by merging the word elements from the plurality of overlapping segments and providing the transcription as an output of the automated speech recognition system.

    Enhanced attention mechanisms
    6.
    发明授权

    公开(公告)号:US11210475B2

    公开(公告)日:2021-12-28

    申请号:US16518518

    申请日:2019-07-22

    Applicant: Google LLC

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for enhanced attention mechanisms. In some implementations, data indicating an input sequence is received. The data is processed using an encoder neural network to generate a sequence of encodings. A series of attention outputs is determined using one or more attender modules. Determining each attention output can include (i) selecting an encoding from the sequence of encodings and (ii) determining attention over a proper subset of the sequence of encodings, where the proper subset of encodings is determined based on a position of the selected encoding in the sequence of encodings. The selections of encodings are also monotonic through the sequence of encodings. An output sequence is generated by processing the attention outputs using a decoder neural network. An output is provided that indicates a language sequence determined from the output sequence.

Patent Agency Ranking