-
公开(公告)号:US11907845B2
公开(公告)日:2024-02-20
申请号:US16994656
申请日:2020-08-17
发明人: Takashi Fukuda , Samuel Thomas
摘要: Some embodiments of the present invention are directed to techniques for training teacher neural networks (TNNs) and student neural networks (SNNs). A training data set is received with a lossless set of data and a corresponding lossy set of data. Two branches of a TNN are established, with one branch trained using the lossless data (a lossless branch) and one trained using the lossy data (a lossy branch). Weights for the two branches are tied together. The lossy branch, now isolated from the lossless branch, generates a set of soft targets for initializing an SNN. These generated soft targets benefit from the training of lossless branch through the weights that were tied together between each branch, despite isolating the lossless branch from the lossy branch during soft-target generation.
-
公开(公告)号:US20240038221A1
公开(公告)日:2024-02-01
申请号:US17815798
申请日:2022-07-28
发明人: Sashi Novitasari , Takashi Fukuda , Gakuto Kurata
CPC分类号: G10L15/16 , G10L25/78 , G10L15/22 , G10L15/063 , G10L15/20
摘要: Systems, computer-implemented methods, and computer program products to facilitate multi-task training a recurrent neural network transducer (RNN-T) using automatic speech recognition (ASR) information are provided. According to an embodiment, a system can comprise a memory that stores computer executable components and a processor that executes the computer executable components stored in the memory. The computer executable components can include an RNN-T that can receive ASR information. The computer executable components can include a voice activity detection (VAD) model that trains the RNN-T using the ASR information, where the RNN-T can further comprise an encoder and a joint network. One or more outputs of the encoder can be integrated with the joint network and one or more outputs of the VAD model.
-
公开(公告)号:US20230237987A1
公开(公告)日:2023-07-27
申请号:US17580846
申请日:2022-01-21
发明人: Takashi Fukuda , Tohru Nagano
CPC分类号: G10L15/02 , G10L15/063 , G06F7/24 , G10L2015/025
摘要: A computer-implemented method for preparing training data for a speech recognition model is provided including obtaining a plurality of sentences from a corpus, dividing each phoneme in each sentence of the plurality of sentences into three hidden states, calculating, for each sentence of the plurality of sentences, a score based on a variation in duration of the three hidden states of each phoneme in the sentence, and sorting the plurality of sentences by using the calculated scores.
-
公开(公告)号:US20230153601A1
公开(公告)日:2023-05-18
申请号:US17526350
申请日:2021-11-15
发明人: Takashi Fukuda , Samuel Thomas
CPC分类号: G06N3/08 , G06N3/0454 , G10L15/00
摘要: A computer-implemented method for training a neural transducer for speech recognition is provided. The method includes initializing the neural transducer having a prediction network and an encoder network and a joint network. The method further includes expanding the prediction network by changing the prediction network to a plurality of prediction-net branches. Each of the prediction-net branches is a prediction network for a respective specific sub-task from among a plurality of specific sub-tasks. The method also includes training, by a hardware processor, an entirety of the neural transducer by using training data sets for all of the plurality of specific sub-tasks. The method additionally includes obtaining a trained neural transducer by fusing the plurality of prediction-net branches.
-
公开(公告)号:US11416741B2
公开(公告)日:2022-08-16
申请号:US16003790
申请日:2018-06-08
摘要: A technique for constructing a model supporting a plurality of domains is disclosed. In the technique, a plurality of teacher models, each of which is specialized for different one of the plurality of the domains, is prepared. A plurality of training data collections, each of which is collected for different one of the plurality of the domains, is obtained. A plurality of soft label sets is generated by inputting each training data in the plurality of the training data collections into corresponding one of the plurality of the teacher models. A student model is trained using the plurality of the soft label sets.
-
公开(公告)号:US11410029B2
公开(公告)日:2022-08-09
申请号:US15860097
申请日:2018-01-02
IPC分类号: G06N3/08
摘要: A technique for generating soft labels for training is disclosed. A teacher model having a teacher side class set is prepared. A collection of class pairs for respective data units is obtained. Class pairs includes classes labelled to corresponding data units from the teacher side class set and a student side class set different from the teacher side class set. A training input is fed into the teacher model to obtain a set of outputs for the teacher side class set. A set of soft labels for the student side class set is calculated from the set of the outputs by using at least an output obtained for a class within a subset of the teacher side class set having relevance to the member of the student side class set, based at least in part on observations in the collection of the class pairs.
-
公开(公告)号:US11227579B2
公开(公告)日:2022-01-18
申请号:US16535829
申请日:2019-08-08
发明人: Toru Nagano , Takashi Fukuda , Masayuki Suzuki , Gakuto Kurata
IPC分类号: G10L13/033 , G10L15/18 , G06F40/205 , G06F40/284
摘要: A technique for data augmentation for speech data is disclosed. Original speech data including a sequence of feature frames is obtained. A partially prolonged copy of the original speech data is generated by inserting one or more new frames into the sequence of the feature frames. The partially prolonged copy is output as augmented speech data for training an acoustic model for training an acoustic model.
-
公开(公告)号:US11037583B2
公开(公告)日:2021-06-15
申请号:US16116042
申请日:2018-08-29
发明人: Masayuki Suzuki , Takashi Fukuda , Toru Nagano
摘要: A technique for detecting a music segment in an audio signal is disclosed. A time window is set for each section in an audio signal. A maximum and a statistic of the audio signal within the time window are calculated. A density index is computed for the section using the maximum and the statistic. The density index is a measure of the statistic relative to the maximum. The section is estimated as a music segment based, at least in part, on a condition with respect to the density index.
-
公开(公告)号:US11003983B2
公开(公告)日:2021-05-11
申请号:US16670201
申请日:2019-10-31
发明人: Takashi Fukuda
IPC分类号: G06N3/04 , G06N3/063 , G10L21/0232 , G10L15/20 , G10L21/0208 , G06N3/08 , G10L15/16
摘要: A computer-implemented method for training a front-end neural network (“front-end NN”) and a back-end neural network (“back-end NN”) is provided. The method includes combining the back-end neural network with the front-end neural network to form a joint layer to thereby generate a combined neural network. The method also includes training the combined neural network for a speech recognition with a set of utterances as training data, with the joint layer having a plurality of frames and each frame having a plurality of bins, and where one or more specific units in each frame are dropped during the training, each of the specific units being selected randomly or based on a bin number to which the respective unit is set within its frame, with the specific units corresponding to one or more common frequency bands.
-
公开(公告)号:US10832129B2
公开(公告)日:2020-11-10
申请号:US15288515
申请日:2016-10-07
摘要: A method for transferring acoustic knowledge of a trained acoustic model (AM) to a neural network (NN) includes reading, into memory, the NN and the AM, the AM being trained with target domain data, and a set of training data including a set of phoneme data, the set of training data being data obtained from a domain different from a target domain for the target domain data, inputting training data from the set of training data into the AM, calculating one or more posterior probabilities of context-dependent states corresponding to phonemes in a phoneme class of a phoneme to which each frame in the training data belongs, and generating a posterior probability vector from the one or more posterior probabilities, as a soft label for the NN, and inputting the training data into the NN and updating the NN, using the soft label.
-
-
-
-
-
-
-
-
-