FAST AND EFFICIENT TEXT ONLY ADAPTATION FOR FACTORIZED NEURAL TRANSDUCER

    公开(公告)号:US20240135919A1

    公开(公告)日:2024-04-25

    申请号:US17983660

    申请日:2022-11-09

    摘要: Systems and methods are provided for accessing a factorized neural transducer comprising a first set of layers for predicting blank tokens and a second set of layers for predicting vocabulary tokens, the second set of layers comprising a language model that includes a vocabulary predictor which is a separate predictor from the blank predictor, wherein a vocabulary predictor output from the vocabulary predictor and the encoder output are used for predicting a vocabulary token. The second set of layers is selectively modified to facilitate an improvement in an accuracy of the factorized neural transducer in performing automatic speech recognition, the selectively modifying comprising applying a particular modification to the second set of layers while refraining from applying the particular modification to the first set of layers.

    INTERNAL LANGUAGE MODEL FOR E2E MODELS

    公开(公告)号:US20220139380A1

    公开(公告)日:2022-05-05

    申请号:US17154956

    申请日:2021-01-21

    摘要: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

    CONDITION-INVARIANT FEATURE EXTRACTION NETWORK

    公开(公告)号:US20200335122A1

    公开(公告)日:2020-10-22

    申请号:US16434665

    申请日:2019-06-07

    摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

    UNIVERSAL ACOUSTIC MODELING USING NEURAL MIXTURE MODELS

    公开(公告)号:US20200334527A1

    公开(公告)日:2020-10-22

    申请号:US16414378

    申请日:2019-05-16

    摘要: According to some embodiments, a universal modeling system may include a plurality of domain expert models to each receive raw input data (e.g., a stream of audio frames containing speech utterances) and provide a domain expert output based on the raw input data. A neural mixture component may then generate a weight corresponding to each domain expert model based on information created by the plurality of domain expert models (e.g., hidden features and/or row convolution). The weights might be associated with, for example, constrained scalar numbers, unconstrained scaler numbers, vectors, matrices, etc. An output layer may provide a universal modeling system output (e.g., an automatic speech recognition result) based on each domain expert output after being multiplied by the corresponding weight for that domain expert model.

    EFFICIENCY ADJUSTABLE SPEECH RECOGNITION SYSTEM

    公开(公告)号:US20220351718A1

    公开(公告)日:2022-11-03

    申请号:US17244891

    申请日:2021-04-29

    摘要: A computing system is configured to generate a transformer-transducer-based deep neural network. The transformer-transducer-based deep neural network comprises a transformer encoder network and a transducer predictor network. The transformer encoder network has a plurality of layers, each of which includes a multi-head attention network sublayer and a feed-forward network sublayer. The computing system trains an end-to-end (E2E) automatic speech recognition (ASR) model, using the transformer-transducer-based deep neural network. The E2E ASR model has one or more adjustable hyperparameters that are configured to dynamically adjust an efficiency or a performance of E2E ASR model when the E2E ASR model is deployed onto a device or executed by the device.

    CONDITION-INVARIANT FEATURE EXTRACTION NETWORK

    公开(公告)号:US20220165290A1

    公开(公告)日:2022-05-26

    申请号:US17537831

    申请日:2021-11-30

    摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.

    ATTENTIVE ADVERSARIAL DOMAIN-INVARIANT TRAINING

    公开(公告)号:US20200335108A1

    公开(公告)日:2020-10-22

    申请号:US16523517

    申请日:2019-07-26

    摘要: To generate substantially domain-invariant and speaker-discriminative features, embodiments are associated with a feature extractor to receive speech frames and extract features from the speech frames based on a first set of parameters of the feature extractor, a senone classifier to identify a senone based on the received features and on a second set of parameters of the senone classifier, an attention network capable of determining a relative importance of features extracted by the feature extractor to domain classification, based on a third set of parameters of the attention network, a domain classifier capable of classifying a domain based on the features and the relative importances, and on a fourth set of parameters of the domain classifier; and a training platform to train the first set of parameters of the feature extractor and the second set of parameters of the senone classifier to minimize the senone classification loss, train the first set of parameters of the feature extractor to maximize the domain classification loss, and train the third set of parameters of the attention network and the fourth set of parameters of the domain classifier to minimize the domain classification loss.

    ADVERSARIAL SPEAKER ADAPTATION
    8.
    发明申请

    公开(公告)号:US20200335085A1

    公开(公告)日:2020-10-22

    申请号:US16460027

    申请日:2019-07-02

    IPC分类号: G10L15/06 G10L15/02 G10L15/22

    摘要: Embodiments are associated with a speaker-independent acoustic model capable of classifying senones based on input speech frames and on first parameters of the speaker-independent acoustic model, a speaker-dependent acoustic model capable of classifying senones based on input speech frames and on second parameters of the speaker-dependent acoustic model, and a discriminator capable of receiving data from the speaker-dependent acoustic model and data from the speaker-independent acoustic model and outputting a prediction of whether received data was generated by the speaker-dependent acoustic model based on third parameters. The second parameters are initialized based on the first parameters, the second parameters are trained based on input frames of a target speaker to minimize a senone classification loss associated with the second parameters, a portion of the second parameters are trained based on the input frames of the target speaker to maximize a discrimination loss associated with the discriminator, and the third parameters are trained based on the input frames of the target speaker to minimize the discrimination loss.

    OBJECT DATA STORAGE
    9.
    发明公开
    OBJECT DATA STORAGE 审中-公开

    公开(公告)号:US20240212394A1

    公开(公告)日:2024-06-27

    申请号:US18599139

    申请日:2024-03-07

    摘要: The disclosure herein describes systems and methods for object data storage. In some examples, the method includes generating a profile for an object in a directory, the profile including a first feature vector corresponding to the object and a global unique identifier (GUID) corresponding to the first feature vector in the profile; generating a search scope, the search scope including at least the GUID corresponding to the profile; generating a second feature vector from a live image scan; matching the generated second feature vector from the live image scan to the first feature vector using the generated search scope; identifying the GUID corresponding to the first feature vector that matches the second feature vector; and outputting information corresponding to the object of the profile identified by the GUID corresponding to the first feature vector.

    CONTEXTUAL SPELLING CORRECTION (CSC) FOR AUTOMATIC SPEECH RECOGNITION (ASR)

    公开(公告)号:US20220415314A1

    公开(公告)日:2022-12-29

    申请号:US17823887

    申请日:2022-08-31

    IPC分类号: G10L15/19 G10L15/16

    摘要: Novel solutions for speech recognition provide contextual spelling correction (CSC) for automatic speech recognition (ASR). Disclosed examples include receiving an audio stream; performing an ASR process on the audio stream to produce an ASR hypothesis; receiving a context list; and, based on at least the ASR hypothesis and the context list, performing spelling correction to produce an output text sequence. A contextual spelling correction (CSC) model is used on top of an ASR model, precluding the need for changing the original ASR model. This permits run-time user customization based on contextual data, even for large-size context lists. Some examples include filtering ASR hypotheses for the audio stream and, based on at least the ASR hypotheses filtering, determining whether to trigger spelling correction for the ASR hypothesis. Some examples include generating text to speech (TTS) audio using preprocessed transcriptions with context phrases to train the CSC model.