-
公开(公告)号:US11776548B2
公开(公告)日:2023-10-03
申请号:US17665862
申请日:2022-02-07
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yong Zhao , Tianyan Zhou , Jinyu Li , Yifan Gong , Jian Wu , Zhuo Chen
Abstract: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.
-
公开(公告)号:US20210304769A1
公开(公告)日:2021-09-30
申请号:US15931788
申请日:2020-05-14
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale
Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.
-
公开(公告)号:US12205596B2
公开(公告)日:2025-01-21
申请号:US18108316
申请日:2023-02-10
Applicant: Microsoft Technology Licensing, LLC
Inventor: Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale
Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.
-
公开(公告)号:US11929076B2
公开(公告)日:2024-03-12
申请号:US18060949
申请日:2022-12-01
Applicant: Microsoft Technology Licensing, LLC
Inventor: Hosam Adel Khalil , Emilian Stoimenov , Christopher Hakan Basoglu , Kshitiz Kumar , Jian Wu
CPC classification number: G10L15/32 , G10L15/16 , G10L15/30 , G10L19/167 , G10L25/51 , G10L2015/088
Abstract: Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.
-
公开(公告)号:US11587569B2
公开(公告)日:2023-02-21
申请号:US15931788
申请日:2020-05-14
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale
Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.
-
公开(公告)号:US11276410B2
公开(公告)日:2022-03-15
申请号:US16682921
申请日:2019-11-13
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yong Zhao , Tianyan Zhou , Jinyu Li , Yifan Gong , Jian Wu , Zhuo Chen
Abstract: Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.
-
公开(公告)号:US12230255B2
公开(公告)日:2025-02-18
申请号:US17726465
申请日:2022-04-21
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Venkata Naga Vijaya Swetha Machanavajhala , Ryan Graham Williams , Sanghee Oh , Ikuyo Tsunoda , William D. Lewis , Jian Wu , Daniel Charles Tompkins
Abstract: The techniques disclosed herein provide intelligent display of auditory world experiences. Specialized AI models are configured to display integrated visualizations for different aspects of the auditory signals that may be communicated during an event, such as a meeting, chat session, etc. For instance, a system can use a sentiment recognition model to identify specific characteristics of a speech input, such as volume or tone, provided by a participant. The system can also use a speech recognition model to identify keywords that can be used to distinguish portions of a transcript that are displayed. The system can also utilize an audio recognition model that is configured to analyze non-speech audio sounds for the purposes of identifying non-speech events. The system can then integrate the user interface attributes, distinguished portions of the transcript, and visual indicators describing the non-speech events.
-
公开(公告)号:US11532312B2
公开(公告)日:2022-12-20
申请号:US17123087
申请日:2020-12-15
Applicant: Microsoft Technology Licensing, LLC
Inventor: Hosam Adel Khalil , Emilian Stoimenov , Christopher Hakan Basoglu , Kshitiz Kumar , Jian Wu
Abstract: Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.
-
公开(公告)号:US11289086B2
公开(公告)日:2022-03-29
申请号:US16724096
申请日:2019-12-20
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: Nicholas David Burton , Arash Ghanaie-Sichanie , Qi Liu , Senthil Kumar Velayutham , Jian Wu
Abstract: A system and method for selecting a target device out of a larger group of candidate devices for rendering a response from a virtual assistant to an end-user is disclosed. The system determines that a same trigger phrase included in an utterance has been received by multiple devices that are in proximity to one another at around the same time. These candidate devices can collect attention data, such as user gaze toward a device, to select the device that was most likely the intended recipient of the utterance. The system is configured to control the virtual assistant to render a response solely via the selected device.
-
-
-
-
-
-
-
-