Patent search ap:("Microsoft Technology Licensing Page LLC") AND inv:"Jian Wu"

1.

发明授权
Convolutional neural network with phonetic attention for speaker verification 有权

公开(公告)号：US11776548B2

公开(公告)日：2023-10-03

申请号：US17665862

申请日：2022-02-07

Applicant: Microsoft Technology Licensing, LLC

Inventor： Yong Zhao , Tianyan Zhou , Jinyu Li , Yifan Gong , Jian Wu , Zhuo Chen

IPC: G10L17/14 , G10L17/18 , G06N3/08 , G10L17/02

CPC classification number: G10L17/18 , G06N3/08 , G10L17/02 , G10L17/14

Abstract: Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.

2.

发明申请
GENERATING AND USING TEXT-TO-SPEECH DATA FOR SPEECH RECOGNITION MODELS 有权

公开(公告)号：US20210304769A1

公开(公告)日：2021-09-30

申请号：US15931788

申请日：2020-05-14

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale

IPC: G10L15/26 , G10L15/16 , G10L15/06 , G10L13/08

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

3.

发明授权
Generating and using text-to-speech data for speech recognition models 有权

公开(公告)号：US12205596B2

公开(公告)日：2025-01-21

申请号：US18108316

申请日：2023-02-10

Applicant: Microsoft Technology Licensing, LLC

Inventor： Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale

IPC: G10L15/26 , G10L13/08 , G10L15/06 , G10L15/16

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

4.

发明授权
User-perceived latency while maintaining accuracy 有权

公开(公告)号：US11929076B2

公开(公告)日：2024-03-12

申请号：US18060949

申请日：2022-12-01

Applicant: Microsoft Technology Licensing, LLC

Inventor： Hosam Adel Khalil , Emilian Stoimenov , Christopher Hakan Basoglu , Kshitiz Kumar , Jian Wu

IPC: G10L15/32 , G10L15/16 , G10L15/30 , G10L19/16 , G10L25/51 , G10L15/08

CPC classification number: G10L15/32 , G10L15/16 , G10L15/30 , G10L19/167 , G10L25/51 , G10L2015/088

Abstract: Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.

5.

发明授权
Generating and using text-to-speech data for speech recognition models 有权

公开(公告)号：US11587569B2

公开(公告)日：2023-02-21

申请号：US15931788

申请日：2020-05-14

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： Guoli Ye , Yan Huang , Wenning Wei , Lei He , Eva Sharma , Jian Wu , Yao Tian , Edward C. Lin , Yifan Gong , Rui Zhao , Jinyu Li , William Maxwell Gale

IPC: G10L15/26 , G10L13/08 , G10L15/06 , G10L15/16

Abstract: Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data. In some instances, the new TTS training is obtained from a multi-speaker neural TTS system for a keyword that is underrepresented in the baseline training data. In some instances, the new TTS training data is used for pronunciation learning and normalization of keyword dependent confidence scores in keyword spotting (KWS) applications. In some instances, the new TTS training data is used for rapid speaker adaptation in speech recognition models.

6.

发明授权
Convolutional neural network with phonetic attention for speaker verification 有权

公开(公告)号：US11276410B2

公开(公告)日：2022-03-15

申请号：US16682921

申请日：2019-11-13

Applicant: Microsoft Technology Licensing, LLC

Inventor： Yong Zhao , Tianyan Zhou , Jinyu Li , Yifan Gong , Jian Wu , Zhuo Chen

IPC: G10L17/14 , G10L17/18 , G06N3/08 , G10L17/02

Abstract: Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.

7.

发明授权
Intelligent display of auditory world experiences 有权

公开(公告)号：US12230255B2

公开(公告)日：2025-02-18

申请号：US17726465

申请日：2022-04-21

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： Venkata Naga Vijaya Swetha Machanavajhala , Ryan Graham Williams , Sanghee Oh , Ikuyo Tsunoda , William D. Lewis , Jian Wu , Daniel Charles Tompkins

IPC: G10L15/18 , G06F3/14 , G06F3/16 , G10L15/22 , G10L25/51 , G10L25/78 , G10L15/08

Abstract: The techniques disclosed herein provide intelligent display of auditory world experiences. Specialized AI models are configured to display integrated visualizations for different aspects of the auditory signals that may be communicated during an event, such as a meeting, chat session, etc. For instance, a system can use a sentiment recognition model to identify specific characteristics of a speech input, such as volume or tone, provided by a participant. The system can also use a speech recognition model to identify keywords that can be used to distinguish portions of a transcript that are displayed. The system can also utilize an audio recognition model that is configured to analyze non-speech audio sounds for the purposes of identifying non-speech events. The system can then integrate the user interface attributes, distinguished portions of the transcript, and visual indicators describing the non-speech events.

8.

发明授权
User-perceived latency while maintaining accuracy 有权

公开(公告)号：US11532312B2

公开(公告)日：2022-12-20

申请号：US17123087

申请日：2020-12-15

Applicant: Microsoft Technology Licensing, LLC

Inventor： Hosam Adel Khalil , Emilian Stoimenov , Christopher Hakan Basoglu , Kshitiz Kumar , Jian Wu

IPC: G10L15/30 , G10L15/16 , G10L19/16 , G10L25/51 , G10L15/08

Abstract: Disclosed speech recognition techniques improve user-perceived latency while maintaining accuracy by: receiving an audio stream, in parallel, by a primary (e.g., accurate) speech recognition engine (SRE) and a secondary (e.g., fast) SRE; generating, with the primary SRE, a primary result; generating, with the secondary SRE, a secondary result; appending the secondary result to a word list; and merging the primary result into the secondary result in the word list. Combining output from the primary and secondary SREs into a single decoder as described herein improves user-perceived latency while maintaining or improving accuracy, among other advantages.

9.

发明授权
Selective response rendering for virtual assistants 有权

公开(公告)号：US11289086B2

公开(公告)日：2022-03-29

申请号：US16724096

申请日：2019-12-20

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： Nicholas David Burton , Arash Ghanaie-Sichanie , Qi Liu , Senthil Kumar Velayutham , Jian Wu

IPC: G10L15/22 , G06F9/451 , G06F3/01 , G06F3/16 , G10L15/25 , G10L25/51 , G10L15/08

Abstract: A system and method for selecting a target device out of a larger group of candidate devices for rendering a response from a virtual assistant to an end-user is disclosed. The system determines that a same trigger phrase included in an utterance has been received by multiple devices that are in proximity to one another at around the same time. These candidate devices can collect attention data, such as user gaze toward a device, to select the device that was most likely the intended recipient of the utterance. The system is configured to control the virtual assistant to render a response solely via the selected device.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification