-
公开(公告)号:US20240135919A1
公开(公告)日:2024-04-25
申请号:US17983660
申请日:2022-11-09
发明人: Rui ZHAO , Jian XUE , Sarangarajan PARTHASARATHY , Jinyu LI
IPC分类号: G10L15/16 , G10L15/06 , G10L15/197 , G10L15/22
CPC分类号: G10L15/16 , G10L15/063 , G10L15/197 , G10L15/22
摘要: Systems and methods are provided for accessing a factorized neural transducer comprising a first set of layers for predicting blank tokens and a second set of layers for predicting vocabulary tokens, the second set of layers comprising a language model that includes a vocabulary predictor which is a separate predictor from the blank predictor, wherein a vocabulary predictor output from the vocabulary predictor and the encoder output are used for predicting a vocabulary token. The second set of layers is selectively modified to facilitate an improvement in an accuracy of the factorized neural transducer in performing automatic speech recognition, the selectively modifying comprising applying a particular modification to the second set of layers while refraining from applying the particular modification to the first set of layers.
-
公开(公告)号:US20220139380A1
公开(公告)日:2022-05-05
申请号:US17154956
申请日:2021-01-21
发明人: Zhong MENG , Sarangarajan PARTHASARATHY , Xie SUN , Yashesh GAUR , Naoyuki KANDA , Liang LU , Xie CHEN , Rui ZHAO , Jinyu LI , Yifan GONG
IPC分类号: G10L15/16 , G06N3/04 , G10L15/06 , G10L15/01 , G10L15/183
摘要: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.
-
公开(公告)号:US20200335122A1
公开(公告)日:2020-10-22
申请号:US16434665
申请日:2019-06-07
发明人: Zhong MENG , Yong ZHAO , Jinyu LI , Yifan GONG
摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.
-
公开(公告)号:US20200334527A1
公开(公告)日:2020-10-22
申请号:US16414378
申请日:2019-05-16
发明人: Amit DAS , Jinyu LI , Changliang LIU , Yifan GONG
摘要: According to some embodiments, a universal modeling system may include a plurality of domain expert models to each receive raw input data (e.g., a stream of audio frames containing speech utterances) and provide a domain expert output based on the raw input data. A neural mixture component may then generate a weight corresponding to each domain expert model based on information created by the plurality of domain expert models (e.g., hidden features and/or row convolution). The weights might be associated with, for example, constrained scalar numbers, unconstrained scaler numbers, vectors, matrices, etc. An output layer may provide a universal modeling system output (e.g., an automatic speech recognition result) based on each domain expert output after being multiplied by the corresponding weight for that domain expert model.
-
公开(公告)号:US20220351718A1
公开(公告)日:2022-11-03
申请号:US17244891
申请日:2021-04-29
发明人: Yu WU , Jinyu LI , Shujie LIU , Xie CHEN , Chengyi WANG
摘要: A computing system is configured to generate a transformer-transducer-based deep neural network. The transformer-transducer-based deep neural network comprises a transformer encoder network and a transducer predictor network. The transformer encoder network has a plurality of layers, each of which includes a multi-head attention network sublayer and a feed-forward network sublayer. The computing system trains an end-to-end (E2E) automatic speech recognition (ASR) model, using the transformer-transducer-based deep neural network. The E2E ASR model has one or more adjustable hyperparameters that are configured to dynamically adjust an efficiency or a performance of E2E ASR model when the E2E ASR model is deployed onto a device or executed by the device.
-
公开(公告)号:US20220165290A1
公开(公告)日:2022-05-26
申请号:US17537831
申请日:2021-11-30
发明人: Zhong MENG , Yong ZHAO , Jinyu LI , Yifan GONG
摘要: To generate substantially condition-invariant and speaker-discriminative features, embodiments are associated with a feature extractor capable of extracting features from speech frames based on first parameters, a speaker classifier capable of identifying a speaker based on the features and on second parameters, and a condition classifier capable of identifying a noise condition based on the features and on third parameters. The first parameters of the feature extractor and the second parameters of the speaker classifier are trained to minimize a speaker classification loss, the first parameters of the feature extractor are further trained to maximize a condition classification loss, and the third parameters of the condition classifier are trained to minimize the condition classification loss.
-
公开(公告)号:US20200335108A1
公开(公告)日:2020-10-22
申请号:US16523517
申请日:2019-07-26
发明人: Zhong MENG , Jinyu LI , Yifan GONG
摘要: To generate substantially domain-invariant and speaker-discriminative features, embodiments are associated with a feature extractor to receive speech frames and extract features from the speech frames based on a first set of parameters of the feature extractor, a senone classifier to identify a senone based on the received features and on a second set of parameters of the senone classifier, an attention network capable of determining a relative importance of features extracted by the feature extractor to domain classification, based on a third set of parameters of the attention network, a domain classifier capable of classifying a domain based on the features and the relative importances, and on a fourth set of parameters of the domain classifier; and a training platform to train the first set of parameters of the feature extractor and the second set of parameters of the senone classifier to minimize the senone classification loss, train the first set of parameters of the feature extractor to maximize the domain classification loss, and train the third set of parameters of the attention network and the fourth set of parameters of the domain classifier to minimize the domain classification loss.
-
公开(公告)号:US20200335085A1
公开(公告)日:2020-10-22
申请号:US16460027
申请日:2019-07-02
发明人: Zhong MENG , Jinyu LI , Yifan GONG
摘要: Embodiments are associated with a speaker-independent acoustic model capable of classifying senones based on input speech frames and on first parameters of the speaker-independent acoustic model, a speaker-dependent acoustic model capable of classifying senones based on input speech frames and on second parameters of the speaker-dependent acoustic model, and a discriminator capable of receiving data from the speaker-dependent acoustic model and data from the speaker-independent acoustic model and outputting a prediction of whether received data was generated by the speaker-dependent acoustic model based on third parameters. The second parameters are initialized based on the first parameters, the second parameters are trained based on input frames of a target speaker to minimize a senone classification loss associated with the second parameters, a portion of the second parameters are trained based on the input frames of the target speaker to maximize a discrimination loss associated with the discriminator, and the third parameters are trained based on the input frames of the target speaker to minimize the discrimination loss.
-
公开(公告)号:US20240212394A1
公开(公告)日:2024-06-27
申请号:US18599139
申请日:2024-03-07
发明人: William Louis THOMAS , Jinyu LI , Yang CHEN , Youyou HAN OPPENLANDER , Steven John BOWLES , Qingfen LIN
CPC分类号: G06V40/50 , G06F16/285 , G06V10/751 , G06V40/168 , G06V40/172
摘要: The disclosure herein describes systems and methods for object data storage. In some examples, the method includes generating a profile for an object in a directory, the profile including a first feature vector corresponding to the object and a global unique identifier (GUID) corresponding to the first feature vector in the profile; generating a search scope, the search scope including at least the GUID corresponding to the profile; generating a second feature vector from a live image scan; matching the generated second feature vector from the live image scan to the first feature vector using the generated search scope; identifying the GUID corresponding to the first feature vector that matches the second feature vector; and outputting information corresponding to the object of the profile identified by the GUID corresponding to the first feature vector.
-
公开(公告)号:US20220415314A1
公开(公告)日:2022-12-29
申请号:US17823887
申请日:2022-08-31
发明人: Xiaoqiang WANG , Yanqing LIU , Sheng ZHAO , Jinyu LI
摘要: Novel solutions for speech recognition provide contextual spelling correction (CSC) for automatic speech recognition (ASR). Disclosed examples include receiving an audio stream; performing an ASR process on the audio stream to produce an ASR hypothesis; receiving a context list; and, based on at least the ASR hypothesis and the context list, performing spelling correction to produce an output text sequence. A contextual spelling correction (CSC) model is used on top of an ASR model, precluding the need for changing the original ASR model. This permits run-time user customization based on contextual data, even for large-size context lists. Some examples include filtering ASR hypotheses for the audio stream and, based on at least the ASR hypotheses filtering, determining whether to trigger spelling correction for the ASR hypothesis. Some examples include generating text to speech (TTS) audio using preprocessed transcriptions with context phrases to train the CSC model.
-
-
-
-
-
-
-
-
-