Patent search ap:("MICROSOFT TECHNOLOGY LICENSING Page LLC.") AND inv:"YOSHIOKA

1.

发明申请
MULTI-MICROPHONE SPEECH SEPARATION 审中-公开

公开(公告)号：WO2019199554A1

公开(公告)日：2019-10-17

申请号：PCT/US2019/025686

申请日：2019-04-04

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： CHEN, Zhuo , ERDOGAN, Hakan , YOSHIOKA, Takuya , ALLEVA, Fileno A. , XIAO, Xiong

IPC: G10L15/16

Abstract: This document relates to separation of audio signals into speaker-specific signals. One example obtains features reflecting mixed speech signals captured by multiple microphones. The features can be input a neural network and masks can be obtained from the neural network. The masks can be applied one or more of the mixed speech signals captured by one or more of the microphones to obtain two or more separate speaker-specific speech signals, which can then be output.

2.

发明申请
ARRAY GEOMETRY AGNOSTIC MULTI-CHANNEL PERSONALIZED SPEECH ENHANCEMENT 审中-公开

公开(公告)号：WO2023059402A1

公开(公告)日：2023-04-13

申请号：PCT/US2022/040979

申请日：2022-08-22

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC.

Inventor： ESKIMEZ, Sefik Emre , YOSHIOKA, Takuya , WANG, Huaming , TAHERIAN, Hassan , CHEN, Zhuo , HUANG, Xuedong

IPC: G10L21/0208 , G10L21/0272 , G10L2021/02082 , G10L2021/02087

Abstract: Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.

3.

发明申请
MULTI-CHANNEL SPEECH SEPARATION 审中-公开

公开(公告)号：WO2019089486A1

公开(公告)日：2019-05-09

申请号：PCT/US2018/058067

申请日：2018-10-30

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： CHEN, Zhuo , LI, Jinyu , XIAO, Xiong , YOSHIOKA, Takuya , WANG, Huaming , WANG, Zhenghao , GONG, Yifan

IPC: G10L21/0272 , G10L25/30 , G10L21/0308 , G10L21/0216

CPC classification number: G10L21/0216 , G06N3/0445 , G06N3/0454 , G06N3/084 , G10L21/0272 , G10L21/0308 , G10L25/30 , G10L2021/02087 , G10L2021/02166 , H04R3/005 , H04R2430/20

Abstract: Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.

4.

发明申请
SYSTEMS AND METHODS FOR HUMAN LISTENING AND LIVE CAPTIONING 审中-公开

公开(公告)号：WO2022250849A1

公开(公告)日：2022-12-01

申请号：PCT/US2022/026868

申请日：2022-04-29

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： WANG, Xiaofei , ESKIMEZ, Sefik Emre , TANG, Min , YANG, Hemin , ZHU, Zirun , CHEN, Zhuo , WANG, Huaming , YOSHIOKA, Takuya

IPC: G10L15/06 , G10L21/0208 , G10L15/16 , G10L15/26

Abstract: Systems and methods are provided for generating and operating a speech enhancement model optimized for generating noise-suppressed speech outputs for improved human listening and live captioning. A computing system obtains a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs and an automatic speech recognition model trained on a second training dataset to generate transcription labels for spoken language utterances. A third training dataset comprising a set of spoken language utterances is applied to the speech enhancement model to obtain a first noise-suppressed speech output which is applied to the automatic speech recognition model to generate a noise-suppressed transcription output for the set of spoken language utterances. Speech enhancement model parameters are updated to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a comparison of the noise-suppressed transcription output and ground truth transcription labels.

5.

发明申请
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO 审中-公开

公开(公告)号：WO2022132405A1

公开(公告)日：2022-06-23

申请号：PCT/US2021/060423

申请日：2021-11-23

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： KANDA, Naoyuki , CHANG, Xuankai , GAUR, Yashesh , WANG, Xiaofei , MENG, Zhong , YOSHIOKA, Takuya

IPC: G10L15/26 , G10L15/16 , G10L17/00 , G10L15/32 , G10L15/04

Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

6.

发明申请
SPEECH EXTRACTION USING ATTENTION NETWORK 审中-公开

公开(公告)号：WO2020214297A1

公开(公告)日：2020-10-22

申请号：PCT/US2020/022874

申请日：2020-03-16

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： XIAO, Xiong , CHEN, Zhuo , YOSHIOKA, Takuya , LIU, Changliang , ERDOGAN, Hakan , DIMITRIADIS, Dimitrios Basile , GONG, Yifan , DROPPO, James Garnet, III

IPC: G10L21/028 , G10L17/18

Abstract: Embodiments are associated with determination of a first plurality of multi-dimensional vectors, each of the first plurality of multi-dimensional vectors representing speech of a target speaker, determination of a multi-dimensional vector representing a speech signal of two or more speakers, determination of a weighted vector representing speech of the target speaker based on the first plurality of multi-dimensional vectors and on similarities between the multi-dimensional vector and each of the first plurality of multi-dimensional vectors, and extraction of speech of the target speaker from the speech signal based on the weighted vector and the speech signal.

7.

发明申请
LOW-LATENCY SPEECH SEPARATION 审中-公开

公开(公告)号：WO2020205097A1

公开(公告)日：2020-10-08

申请号：PCT/US2020/019851

申请日：2020-02-26

Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC

Inventor： CHEN, Zhuo , LIU, Changliang , YOSHIOKA, Takuya , XIAO, Xiong , ERDOGAN, Hakan , DIMITRIADIS, Dimitrios, Basile

IPC: G10L21/0272 , G10L25/30 , G10L21/0216 , G10L21/0208

Abstract: A system and method include reception of a first plurality of audio signals, generation of a second plurality of beamformed audio signals based on the first plurality of audio signals, each of the second plurality of beamformed audio signals associated with a respective one of a second plurality of beamformer directions, generation of a first TF mask for a first output channel based on the first plurality of audio signals, determination of a first beamformer direction associated with a first target sound source based on the first TF mask, generation of first features based on the first beamformer direction and the first plurality of audio signals, determination of a second TF mask based on the first features, and application of the second TF mask to one of the second plurality of beamformed audio signals associated with the first beamformer direction.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification