-
公开(公告)号:US11900922B2
公开(公告)日:2024-02-13
申请号:US17093673
申请日:2020-11-10
IPC分类号: G10L15/16 , G10L15/08 , G06F40/295 , G06N3/04 , G06F18/214
CPC分类号: G10L15/16 , G06F18/2148 , G06N3/04 , G06F40/295 , G10L2015/088
摘要: Embodiments of the present invention provide computer implemented methods, computer program products and computer systems. For example, embodiments of the present invention can access one or more intents and associated entities from limited amount of speech to text training data in a single language. Embodiments of the present invention can locate speech to text training data in one or more other languages using the accessed one or more intents and associated entities to locate speech to text training data in the one or more other languages different than the single language. Embodiments of the present invention can then train a neural network based on the limited amount of speech to text training data in the single language and the located speech to text training data in the one or more other languages.
-
公开(公告)号:US11610108B2
公开(公告)日:2023-03-21
申请号:US16047287
申请日:2018-07-27
发明人: Takashi Fukuda , Masayuki Suzuki , Osamu Ichikawa , Gakuto Kurata , Samuel Thomas , Bhuvana Ramabhadran
摘要: A student neural network may be trained by a computer-implemented method, including: selecting a teacher neural network among a plurality of teacher neural networks, inputting an input data to the selected teacher neural network to obtain a soft label output generated by the selected teacher neural network, and training a student neural network with at least the input data and the soft label output from the selected teacher neural network.
-
公开(公告)号:US20220319494A1
公开(公告)日:2022-10-06
申请号:US17218618
申请日:2021-03-31
摘要: An approach to training an end-to-end spoken language understanding model may be provided. A pre-trained general automatic speech recognition model may be adapted to a domain specific spoken language understanding model. The pre-trained general automatic speech recognition model may be a recurrent neural network transducer model. The adaptation may provide transcription data annotated with spoken language understanding labels. Adaptation may include audio data may also be provided for in addition to verbatim transcripts annotated with spoken language understanding labels. The spoken language understanding labels may be entity and/or intent based with values associated with each label.
-
公开(公告)号:US20220148581A1
公开(公告)日:2022-05-12
申请号:US17093673
申请日:2020-11-10
摘要: Embodiments of the present invention provide computer implemented methods, computer program products and computer systems. For example, embodiments of the present invention can access one or more intents and associated entities from limited amount of speech to text training data in a single language. Embodiments of the present invention can locate speech to text training data in one or more other languages using the accessed one or more intents and associated entities to locate speech to text training data in the one or more other languages different than the single language. Embodiments of the present invention can then train a neural network based on the limited amount of speech to text training data in the single language and the located speech to text training data in the one or more other languages.
-
55.
公开(公告)号:US11250872B2
公开(公告)日:2022-02-15
申请号:US16714719
申请日:2019-12-14
发明人: Samuel Thomas , Yinghui Huang , Masayuki Suzuki , Zoltan Tueske , Laurence P. Sansone , Michael A. Picheny
摘要: Method, apparatus, and computer program product are provided for customizing an automatic closed captioning system. In some embodiments, at a data use (DU) location, an automatic closed captioning system that includes a base model is provided, search criteria are defined to request from one or more data collection (DC) locations, a search request based on the search criteria is sent to the one or more DC locations, relevant closed caption data from the one or more DC locations are received responsive to the search request, the received relevant closed caption data are processed by computing a confidence score for each of a plurality of data sub-sets of the received relevant closed caption data and selecting one or more of the data sub-sets based on the confidence scores, and the automatic closed captioning system is customized by using the selected one or more data sub-sets to train the base model.
-
56.
公开(公告)号:US20210312906A1
公开(公告)日:2021-10-07
申请号:US16841787
申请日:2020-04-07
摘要: An illustrative embodiment includes a method for training an end-to-end (E2E) spoken language understanding (SLU) system. The method includes receiving a training corpus comprising a set of text classified using one or more sets of semantic labels but unpaired with speech and using the set of unpaired text to train the E2E SLU system to classify speech using at least one of the one or more sets of semantic labels. The method may include training a text-to-intent model using the set of unpaired text; and training a speech-to-intent model using the text-to-intent model. Alternatively or additionally, the method may include using a text-to-speech (TTS) system to generate synthetic speech from the unpaired text; and training the E2E SLU system using the synthetic speech.
-
公开(公告)号:US10546575B2
公开(公告)日:2020-01-28
申请号:US15379038
申请日:2016-12-14
摘要: Audio features, such as perceptual linear prediction (PLP) features and time derivatives thereof, are extracted from frames of training audio data including speech by multiple speakers, and silence, such as by using linear discriminant analysis (LDA). The frames are clustered into k-means clusters using distance measures, such as Mahalanobis distance measures, of means and variances of the extracted audio features of the frames. A recurrent neural network (RNN) is trained on the extracted audio features of the frames and cluster identifiers of the k-means clusters into which the frames have been clustered. The RNN is applied to audio data to segment audio data into segments that each correspond to one of the cluster identifiers. Each segment can be assigned a label corresponding to one of the cluster identifiers. Speech recognition can be performed on the segments.
-
58.
公开(公告)号:US20190205431A1
公开(公告)日:2019-07-04
申请号:US15856505
申请日:2017-12-28
发明人: Anne E. Gattiker , Sujatha Kashyap , Minh Ngoc Binh Nguyen , Samuel Thomas , Kaipeng Li , Thomas Hubregtsen
IPC分类号: G06F17/30
CPC分类号: G06F16/583 , G06F16/2425 , G06F16/24578 , G06F16/9535
摘要: Examples of techniques for constructing, evaluating, and improving a search string for retrieving images are disclosed. In one example implementation according to aspects of the present disclosure, a computer-implemented method includes receiving, by a processing device, a plurality of images as search results returned based at least in part on a search string for an item in the form of a tuple including an item class, an action and an actor. The method further includes determining, by the processing device, whether the search string is effective at indicating a common item use based on image similarity. The method further includes, based at least in part on determining that the search string is ineffective at indicating the item use, generating, by the processing device, an alternative search string.
-
59.
公开(公告)号:US10249292B2
公开(公告)日:2019-04-02
申请号:US15379010
申请日:2016-12-14
摘要: Speaker diarization is performed on audio data including speech by a first speaker, speech by a second speaker, and silence. The speaker diarization includes segmenting the audio data using a long short-term memory (LSTM) recurrent neural network (RNN) to identify change points of the audio data that divide the audio data into segments. The speaker diarization includes assigning a label selected from a group of labels to each segment of the audio data using the LSTM RNN. The group of labels comprising includes labels corresponding to the first speaker, the second speaker, and the silence. Each change point is a transition from one of the first speaker, the second speaker, and the silence to a different one of the first speaker, the second speaker, and the silence. Speech recognition can be performed on the segments that each correspond to one of the first speaker and the second speaker.
-
60.
公开(公告)号:US20180166067A1
公开(公告)日:2018-06-14
申请号:US15379038
申请日:2016-12-14
摘要: Audio features, such as perceptual linear prediction (PLP) features and time derivatives thereof, are extracted from frames of training audio data including speech by multiple speakers, and silence, such as by using linear discriminant analysis (LDA). The frames are clustered into k-means clusters using distance measures, such as Mahalanobis distance measures, of means and variances of the extracted audio features of the frames. A recurrent neural network (RNN) is trained on the extracted audio features of the frames and cluster identifiers of the k-means clusters into which the frames have been clustered. The RNN is applied to audio data to segment audio data into segments that each correspond to one of the cluster identifiers. Each segment can be assigned a label corresponding to one of the cluster identifiers. Speech recognition can be performed on the segments.
-
-
-
-
-
-
-
-
-