-
公开(公告)号:US11715461B2
公开(公告)日:2023-08-01
申请号:US17076794
申请日:2020-10-21
申请人: Md Akmal Haidar , Chao Xing
发明人: Md Akmal Haidar , Chao Xing
CPC分类号: G10L15/16 , G10L15/063
摘要: Computer implemented method and system for automatic speech recognition. A first speech sequence is processed, using a time reduction operation of an encoder NN, into a second speech sequence comprising a second set of speech frame feature vectors that each concatenate information from a respective plurality of speech frame feature vectors included in the first set and includes fewer speech frame feature vectors than the first speech sequence. The second speech sequence is transformed, using a self-attention operation of the encoder NN, into a third speech sequence comprising a third set of speech frame feature vectors. The third speech sequence is processed using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors, and using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors.
-
公开(公告)号:US11698926B2
公开(公告)日:2023-07-11
申请号:US17524862
申请日:2021-11-12
申请人: Arnab Kumar Mondal , Deepak Sridhar , Niamul Quader , Juwei Lu , Peng Dai , Chao Xing
发明人: Arnab Kumar Mondal , Deepak Sridhar , Niamul Quader , Juwei Lu , Peng Dai , Chao Xing
IPC分类号: G06F16/30 , G06F16/732 , G06N3/04 , G06F16/783 , G06V20/40
CPC分类号: G06F16/7343 , G06F16/783 , G06N3/04 , G06V20/40
摘要: Methods and systems are described for performing video retrieval together with video grounding. A word-based query for a video is and encoded into a query representation using a trained query encoder. One or more similar video representations are identified, from a plurality of video representations that are similar to the query representation. Each similar video representation represents a respective relevant video. A grounding is generated for each relevant video by forward propagating each respective similar video representation together with the query representation through a trained grounding module. The relevant videos or identifiers of the relevant videos are outputted together with the grounding generated for each relevant video.
-
公开(公告)号:US20230153352A1
公开(公告)日:2023-05-18
申请号:US17524862
申请日:2021-11-12
申请人: Arnab Kumar MONDAL , Deepak SRIDHAR , Niamul QUADER , Juwei LU , Pen DAI , Chao XING
发明人: Arnab Kumar MONDAL , Deepak SRIDHAR , Niamul QUADER , Juwei LU , Pen DAI , Chao XING
IPC分类号: G06F16/732 , G06F16/783 , G06K9/00 , G06N3/04
CPC分类号: G06F16/7343 , G06F16/783 , G06K9/00711 , G06N3/04
摘要: Methods and systems are described for performing video retrieval together with video grounding. A word-based query for a video is and encoded into a query representation using a trained query encoder. One or more similar video representations are identified, from a plurality of video representations that are similar to the query representation. Each similar video representation represents a respective relevant video. A grounding is generated for each relevant video by forward propagating each respective similar video representation together with the query representation through a trained grounding module. The relevant videos or identifiers of the relevant videos are outputted together with the grounding generated for each relevant video.
-
公开(公告)号:US20220122590A1
公开(公告)日:2022-04-21
申请号:US17076794
申请日:2020-10-21
申请人: Md Akmal HAIDAR , Chao XING
发明人: Md Akmal HAIDAR , Chao XING
摘要: Computer implemented method and system for automatic speech recognition. A first speech sequence is processed, using a time reduction operation of an encoder NN, into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of speech frame feature vectors included in the first set, wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence. The second speech sequence is transformed, using a self-attention operation of the encoder NN, into a third speech sequence that comprises a third set of speech frame feature vectors. The third speech sequence is processed, using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors. The third speech sequence is also processed using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors.
-
公开(公告)号:US20230223018A1
公开(公告)日:2023-07-13
申请号:US17571425
申请日:2022-01-07
申请人: Chao XING , Anderson AVILA
发明人: Chao XING , Anderson AVILA
IPC分类号: G10L15/197 , G10L15/22 , G10L15/18 , G10L15/16 , G10L19/00
CPC分类号: G10L15/197 , G10L15/22 , G10L15/1815 , G10L15/16 , G10L19/00 , G10L2015/223
摘要: The present disclosure describes methods and systems for generating semantic predictions from an input speech signal representing a speaker's speech, and maps the semantic predictions to a command action that represents the speaker's intent. A streamable multimodal language understanding (MLU) system includes a machine learning-based model, such as a RNN model that is trained to convert speech chunks and corresponding text predictions of the input speech signal into semantic predictions that represent a speaker's intent. A semantic prediction is generated and updated, over a series of time steps. In each time step, a new speech chunk and corresponding text prediction of the input speech signal are obtained, encoded and fused to generate an audio-textual representation. A semantic prediction is generated by a sequence classifier by processing the audio-textual representation and the semantic prediction is updated as new speech chunks and corresponding text predictions are obtained. Extracted semantic information contained within a sequence of semantic predictions representing a speaker's speech are acted upon through a command action performed by another computing device or computer application.
-
-
-
-