-
1.
公开(公告)号:US20160275947A1
公开(公告)日:2016-09-22
申请号:US14414621
申请日:2014-09-09
发明人: Jinyu LI , Rui ZHAO , Yifan GONG
摘要: Systems and methods for speech recognition incorporating environmental variables are provided. The systems and methods capture speech to be recognized. The speech is then recognized utilizing a variable component deep neural network (DNN). The variable component DNN processes the captured speech by incorporating an environment variable. The environment variable may be any variable that is dependent on environmental conditions or the relation of the user, the client device, and the environment. For example, the environment variable may be based on noise of the environment and represented as a signal-to-noise ratio. The variable component DNN may incorporate the environment variable in different ways. For instance, the environment variable may be incorporated into weighting matrices and biases of the DNN, the outputs of the hidden layers of the DNN, or the activation functions of the nodes of the DNN.
摘要翻译: 提供了包含环境变量的语音识别系统和方法。 系统和方法捕获要识别的语音。 然后,利用可变分量深神经网络(DNN)识别语音。 可变组件DNN通过并入环境变量来处理捕获的语音。 环境变量可以是取决于环境条件或用户,客户端设备和环境的关系的任何变量。 例如,环境变量可以基于环境的噪声并且表示为信噪比。 可变组件DNN可以以不同的方式并入环境变量。 例如,环境变量可以被合并到DNN的加权矩阵和偏移,DNN的隐含层的输出或者DNN的节点的激活函数中。
-
公开(公告)号:US20220165290A1
公开(公告)日:2022-05-26
申请号:US17537831
申请日:2021-11-30
发明人: Zhong MENG , Yong ZHAO , Jinyu LI , Yifan GONG
摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.
-
公开(公告)号:US20200335108A1
公开(公告)日:2020-10-22
申请号:US16523517
申请日:2019-07-26
发明人: Zhong MENG , Jinyu LI , Yifan GONG
摘要: To generate substantially domain-invariant and speaker-discriminative features, embodiments are associated with a feature extractor to receive speech frames and extract features from the speech frames based on a first set of parameters of the feature extractor, a senone classifier to identify a senone based on the received features and on a second set of parameters of the senone classifier, an attention network capable of determining a relative importance of features extracted by the feature extractor to domain classification, based on a third set of parameters of the attention network, a domain classifier capable of classifying a domain based on the features and the relative importances, and on a fourth set of parameters of the domain classifier; and a training platform to train the first set of parameters of the feature extractor and the second set of parameters of the senone classifier to minimize the senone classification loss, train the first set of parameters of the feature extractor to maximize the domain classification loss, and train the third set of parameters of the attention network and the fourth set of parameters of the domain classifier to minimize the domain classification loss.
-
公开(公告)号:US20200335085A1
公开(公告)日:2020-10-22
申请号:US16460027
申请日:2019-07-02
发明人: Zhong MENG , Jinyu LI , Yifan GONG
摘要: Embodiments are associated with a speaker-independent acoustic model capable of classifying senones based on input speech frames and on first parameters of the speaker-independent acoustic model, a speaker-dependent acoustic model capable of classifying senones based on input speech frames and on second parameters of the speaker-dependent acoustic model, and a discriminator capable of receiving data from the speaker-dependent acoustic model and data from the speaker-independent acoustic model and outputting a prediction of whether received data was generated by the speaker-dependent acoustic model based on third parameters. The second parameters are initialized based on the first parameters, the second parameters are trained based on input frames of a target speaker to minimize a senone classification loss associated with the second parameters, a portion of the second parameters are trained based on the input frames of the target speaker to maximize a discrimination loss associated with the discriminator, and the third parameters are trained based on the input frames of the target speaker to minimize the discrimination loss.
-
公开(公告)号:US20220139380A1
公开(公告)日:2022-05-05
申请号:US17154956
申请日:2021-01-21
发明人: Zhong MENG , Sarangarajan PARTHASARATHY , Xie SUN , Yashesh GAUR , Naoyuki KANDA , Liang LU , Xie CHEN , Rui ZHAO , Jinyu LI , Yifan GONG
IPC分类号: G10L15/16 , G06N3/04 , G10L15/06 , G10L15/01 , G10L15/183
摘要: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.
-
公开(公告)号:US20210217410A1
公开(公告)日:2021-07-15
申请号:US16773205
申请日:2020-01-27
发明人: Hosam A. KHALIL , Emilian Y. STOIMENOV , Yifan GONG , Chaojun LIU , Christopher H. BASOGLU , Amit K. AGARWAL , Naveen PARIHAR , Sayan PATHAK
IPC分类号: G10L15/197 , G10L15/30 , G10L15/02 , G10L15/05 , G10L15/22
摘要: Embodiments may include collection of a first batch of acoustic feature frames of an audio signal, the number of acoustic feature frames of the first batch equal to a first batch size, input of the first batch to a speech recognition network, collection, in response to detection of a word hypothesis output by the speech recognition network, of a second batch of acoustic feature frames of the audio signal, the number of acoustic feature frames of the second batch equal to a second batch size greater than the first batch size, and input of the second batch to the speech recognition network.
-
公开(公告)号:US20200335122A1
公开(公告)日:2020-10-22
申请号:US16434665
申请日:2019-06-07
发明人: Zhong MENG , Yong ZHAO , Jinyu LI , Yifan GONG
摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.
-
公开(公告)号:US20200334527A1
公开(公告)日:2020-10-22
申请号:US16414378
申请日:2019-05-16
发明人: Amit DAS , Jinyu LI , Changliang LIU , Yifan GONG
摘要: According to some embodiments, a universal modeling system may include a plurality of domain expert models to each receive raw input data (e.g., a stream of audio frames containing speech utterances) and provide a domain expert output based on the raw input data. A neural mixture component may then generate a weight corresponding to each domain expert model based on information created by the plurality of domain expert models (e.g., hidden features and/or row convolution). The weights might be associated with, for example, constrained scalar numbers, unconstrained scaler numbers, vectors, matrices, etc. An output layer may provide a universal modeling system output (e.g., an automatic speech recognition result) based on each domain expert output after being multiplied by the corresponding weight for that domain expert model.
-
公开(公告)号:US20170256254A1
公开(公告)日:2017-09-07
申请号:US15199346
申请日:2016-06-30
IPC分类号: G10L15/16 , G10L15/06 , G10L15/02 , G10L15/183 , G10L15/28
CPC分类号: G10L15/16 , G06N3/04 , G10L15/02 , G10L15/063 , G10L15/065 , G10L15/183 , G10L15/28
摘要: The technology described herein uses a modular model to process speech. A deep learning based acoustic model comprises a stack of different types of neural network layers. The sub-modules of a deep learning based acoustic model can be used to represent distinct non-phonetic acoustic factors, such as accent origins (e.g. native, non-native), speech channels (e.g. mobile, bluetooth, desktop etc.), speech application scenario (e.g. voice search, short message dictation etc.), and speaker variation (e.g. individual speakers or clustered speakers), etc. The technology described herein uses certain sub-modules in a first context and a second group of sub-modules in a second context.
-
公开(公告)号:US20230186919A1
公开(公告)日:2023-06-15
申请号:US18108316
申请日:2023-02-10
发明人: Guoli YE , Yan HUANG , Wenning WEI , Lei HE , Eva SHARMA , Jian WU , Yao TIAN , Edward C. LIN , Yifan GONG , Rui ZHAO , Jinyu LI , William Maxwell GALE
CPC分类号: G10L15/26 , G10L13/08 , G10L15/063 , G10L15/16
摘要: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.
-
-
-
-
-
-
-
-
-