Transformer-based automatic speech recognition system incorporating time-reduction layer

    公开(公告)号:US11715461B2

    公开(公告)日:2023-08-01

    申请号:US17076794

    申请日:2020-10-21

    IPC分类号: G10L15/16 G10L15/06

    CPC分类号: G10L15/16 G10L15/063

    摘要: Computer implemented method and system for automatic speech recognition. A first speech sequence is processed, using a time reduction operation of an encoder NN, into a second speech sequence comprising a second set of speech frame feature vectors that each concatenate information from a respective plurality of speech frame feature vectors included in the first set and includes fewer speech frame feature vectors than the first speech sequence. The second speech sequence is transformed, using a self-attention operation of the encoder NN, into a third speech sequence comprising a third set of speech frame feature vectors. The third speech sequence is processed using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors, and using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors.

    TRANSFORMER-BASED AUTOMATIC SPEECH RECOGNITION SYSTEM INCORPORATING TIME-REDUCTION LAYER

    公开(公告)号:US20220122590A1

    公开(公告)日:2022-04-21

    申请号:US17076794

    申请日:2020-10-21

    IPC分类号: G10L15/16 G10L15/06

    摘要: Computer implemented method and system for automatic speech recognition. A first speech sequence is processed, using a time reduction operation of an encoder NN, into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of speech frame feature vectors included in the first set, wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence. The second speech sequence is transformed, using a self-attention operation of the encoder NN, into a third speech sequence that comprises a third set of speech frame feature vectors. The third speech sequence is processed, using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors. The third speech sequence is also processed using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors.

    METHODS AND SYSTEMS FOR STREAMABLE MULTIMODAL LANGUAGE UNDERSTANDING

    公开(公告)号:US20230223018A1

    公开(公告)日:2023-07-13

    申请号:US17571425

    申请日:2022-01-07

    摘要: The present disclosure describes methods and systems for generating semantic predictions from an input speech signal representing a speaker's speech, and maps the semantic predictions to a command action that represents the speaker's intent. A streamable multimodal language understanding (MLU) system includes a machine learning-based model, such as a RNN model that is trained to convert speech chunks and corresponding text predictions of the input speech signal into semantic predictions that represent a speaker's intent. A semantic prediction is generated and updated, over a series of time steps. In each time step, a new speech chunk and corresponding text prediction of the input speech signal are obtained, encoded and fused to generate an audio-textual representation. A semantic prediction is generated by a sequence classifier by processing the audio-textual representation and the semantic prediction is updated as new speech chunks and corresponding text predictions are obtained. Extracted semantic information contained within a sequence of semantic predictions representing a speaker's speech are acted upon through a command action performed by another computing device or computer application.