VARIABLE-COMPONENT DEEP NEURAL NETWORK FOR ROBUST SPEECH RECOGNITION
    1.
    发明申请
    VARIABLE-COMPONENT DEEP NEURAL NETWORK FOR ROBUST SPEECH RECOGNITION 有权
    可变组件深度神经网络,用于强大的语音识别

    公开(公告)号:US20160275947A1

    公开(公告)日:2016-09-22

    申请号:US14414621

    申请日:2014-09-09

    摘要: Systems and methods for speech recognition incorporating environmental variables are provided. The systems and methods capture speech to be recognized. The speech is then recognized utilizing a variable component deep neural network (DNN). The variable component DNN processes the captured speech by incorporating an environment variable. The environment variable may be any variable that is dependent on environmental conditions or the relation of the user, the client device, and the environment. For example, the environment variable may be based on noise of the environment and represented as a signal-to-noise ratio. The variable component DNN may incorporate the environment variable in different ways. For instance, the environment variable may be incorporated into weighting matrices and biases of the DNN, the outputs of the hidden layers of the DNN, or the activation functions of the nodes of the DNN.

    摘要翻译: 提供了包含环境变量的语音识别系统和方法。 系统和方法捕获要识别的语音。 然后,利用可变分量深神经网络(DNN)识别语音。 可变组件DNN通过并入环境变量来处理捕获的语音。 环境变量可以是取决于环境条件或用户,客户端设备和环境的关系的任何变量。 例如,环境变量可以基于环境的噪声并且表示为信噪比。 可变组件DNN可以以不同的方式并入环境变量。 例如,环境变量可以被合并到DNN的加权矩阵和偏移,DNN的隐含层的输出或者DNN的节点的激活函数中。

    CONDITION-INVARIANT FEATURE EXTRACTION NETWORK

    公开(公告)号:US20220165290A1

    公开(公告)日:2022-05-26

    申请号:US17537831

    申请日:2021-11-30

    摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

    ATTENTIVE ADVERSARIAL DOMAIN-INVARIANT TRAINING

    公开(公告)号:US20200335108A1

    公开(公告)日:2020-10-22

    申请号:US16523517

    申请日:2019-07-26

    摘要: To generate substantially domain-invariant and speaker-discriminative features, embodiments are associated with a feature extractor to receive speech frames and extract features from the speech frames based on a first set of parameters of the feature extractor, a senone classifier to identify a senone based on the received features and on a second set of parameters of the senone classifier, an attention network capable of determining a relative importance of features extracted by the feature extractor to domain classification, based on a third set of parameters of the attention network, a domain classifier capable of classifying a domain based on the features and the relative importances, and on a fourth set of parameters of the domain classifier; and a training platform to train the first set of parameters of the feature extractor and the second set of parameters of the senone classifier to minimize the senone classification loss, train the first set of parameters of the feature extractor to maximize the domain classification loss, and train the third set of parameters of the attention network and the fourth set of parameters of the domain classifier to minimize the domain classification loss.

    ADVERSARIAL SPEAKER ADAPTATION
    4.
    发明申请

    公开(公告)号:US20200335085A1

    公开(公告)日:2020-10-22

    申请号:US16460027

    申请日:2019-07-02

    IPC分类号: G10L15/06 G10L15/02 G10L15/22

    摘要: Embodiments are associated with a speaker-independent acoustic model capable of classifying senones based on input speech frames and on first parameters of the speaker-independent acoustic model, a speaker-dependent acoustic model capable of classifying senones based on input speech frames and on second parameters of the speaker-dependent acoustic model, and a discriminator capable of receiving data from the speaker-dependent acoustic model and data from the speaker-independent acoustic model and outputting a prediction of whether received data was generated by the speaker-dependent acoustic model based on third parameters. The second parameters are initialized based on the first parameters, the second parameters are trained based on input frames of a target speaker to minimize a senone classification loss associated with the second parameters, a portion of the second parameters are trained based on the input frames of the target speaker to maximize a discrimination loss associated with the discriminator, and the third parameters are trained based on the input frames of the target speaker to minimize the discrimination loss.

    INTERNAL LANGUAGE MODEL FOR E2E MODELS

    公开(公告)号:US20220139380A1

    公开(公告)日:2022-05-05

    申请号:US17154956

    申请日:2021-01-21

    摘要: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

    CONDITION-INVARIANT FEATURE EXTRACTION NETWORK

    公开(公告)号:US20200335122A1

    公开(公告)日:2020-10-22

    申请号:US16434665

    申请日:2019-06-07

    摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

    UNIVERSAL ACOUSTIC MODELING USING NEURAL MIXTURE MODELS

    公开(公告)号:US20200334527A1

    公开(公告)日:2020-10-22

    申请号:US16414378

    申请日:2019-05-16

    摘要: According to some embodiments, a universal modeling system may include a plurality of domain expert models to each receive raw input data (e.g., a stream of audio frames containing speech utterances) and provide a domain expert output based on the raw input data. A neural mixture component may then generate a weight corresponding to each domain expert model based on information created by the plurality of domain expert models (e.g., hidden features and/or row convolution). The weights might be associated with, for example, constrained scalar numbers, unconstrained scaler numbers, vectors, matrices, etc. An output layer may provide a universal modeling system output (e.g., an automatic speech recognition result) based on each domain expert output after being multiplied by the corresponding weight for that domain expert model.