-
公开(公告)号:WO2019199554A1
公开(公告)日:2019-10-17
申请号:PCT/US2019/025686
申请日:2019-04-04
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: CHEN, Zhuo , ERDOGAN, Hakan , YOSHIOKA, Takuya , ALLEVA, Fileno A. , XIAO, Xiong
IPC: G10L15/16
Abstract: This document relates to separation of audio signals into speaker-specific signals. One example obtains features reflecting mixed speech signals captured by multiple microphones. The features can be input a neural network and masks can be obtained from the neural network. The masks can be applied one or more of the mixed speech signals captured by one or more of the microphones to obtain two or more separate speaker-specific speech signals, which can then be output.
-
公开(公告)号:WO2023059402A1
公开(公告)日:2023-04-13
申请号:PCT/US2022/040979
申请日:2022-08-22
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC.
Inventor: ESKIMEZ, Sefik Emre , YOSHIOKA, Takuya , WANG, Huaming , TAHERIAN, Hassan , CHEN, Zhuo , HUANG, Xuedong
IPC: G10L21/0208 , G10L21/0272 , G10L2021/02082 , G10L2021/02087
Abstract: Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.
-
公开(公告)号:WO2019089486A1
公开(公告)日:2019-05-09
申请号:PCT/US2018/058067
申请日:2018-10-30
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: CHEN, Zhuo , LI, Jinyu , XIAO, Xiong , YOSHIOKA, Takuya , WANG, Huaming , WANG, Zhenghao , GONG, Yifan
IPC: G10L21/0272 , G10L25/30 , G10L21/0308 , G10L21/0216
CPC classification number: G10L21/0216 , G06N3/0445 , G06N3/0454 , G06N3/084 , G10L21/0272 , G10L21/0308 , G10L25/30 , G10L2021/02087 , G10L2021/02166 , H04R3/005 , H04R2430/20
Abstract: Representative embodiments disclose mechanisms to separate and recognize multiple audio sources (e.g., picking out individual speakers) in an environment where they overlap and interfere with each other. The architecture uses a microphone array to spatially separate out the audio signals. The spatially filtered signals are then input into a plurality of separators, so each signal is input into a corresponding signal. The separators use neural networks to separate out audio sources. The separators typically produce multiple output signals for the single input signals. A post selection processor then assesses the separator outputs to pick the signals with the highest quality output. These signals can be used in a variety of systems such as speech recognition, meeting transcription and enhancement, hearing aids, music information retrieval, speech enhancement and so forth.
-
公开(公告)号:WO2022250849A1
公开(公告)日:2022-12-01
申请号:PCT/US2022/026868
申请日:2022-04-29
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: WANG, Xiaofei , ESKIMEZ, Sefik Emre , TANG, Min , YANG, Hemin , ZHU, Zirun , CHEN, Zhuo , WANG, Huaming , YOSHIOKA, Takuya
IPC: G10L15/06 , G10L21/0208 , G10L15/16 , G10L15/26
Abstract: Systems and methods are provided for generating and operating a speech enhancement model optimized for generating noise-suppressed speech outputs for improved human listening and live captioning. A computing system obtains a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs and an automatic speech recognition model trained on a second training dataset to generate transcription labels for spoken language utterances. A third training dataset comprising a set of spoken language utterances is applied to the speech enhancement model to obtain a first noise-suppressed speech output which is applied to the automatic speech recognition model to generate a noise-suppressed transcription output for the set of spoken language utterances. Speech enhancement model parameters are updated to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a comparison of the noise-suppressed transcription output and ground truth transcription labels.
-
公开(公告)号:WO2022132405A1
公开(公告)日:2022-06-23
申请号:PCT/US2021/060423
申请日:2021-11-23
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: KANDA, Naoyuki , CHANG, Xuankai , GAUR, Yashesh , WANG, Xiaofei , MENG, Zhong , YOSHIOKA, Takuya
Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
-
公开(公告)号:WO2020214297A1
公开(公告)日:2020-10-22
申请号:PCT/US2020/022874
申请日:2020-03-16
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: XIAO, Xiong , CHEN, Zhuo , YOSHIOKA, Takuya , LIU, Changliang , ERDOGAN, Hakan , DIMITRIADIS, Dimitrios Basile , GONG, Yifan , DROPPO, James Garnet, III
IPC: G10L21/028 , G10L17/18
Abstract: Embodiments are associated with determination of a first plurality of multi-dimensional vectors, each of the first plurality of multi-dimensional vectors representing speech of a target speaker, determination of a multi-dimensional vector representing a speech signal of two or more speakers, determination of a weighted vector representing speech of the target speaker based on the first plurality of multi-dimensional vectors and on similarities between the multi-dimensional vector and each of the first plurality of multi-dimensional vectors, and extraction of speech of the target speaker from the speech signal based on the weighted vector and the speech signal.
-
公开(公告)号:WO2020205097A1
公开(公告)日:2020-10-08
申请号:PCT/US2020/019851
申请日:2020-02-26
Applicant: MICROSOFT TECHNOLOGY LICENSING, LLC
Inventor: CHEN, Zhuo , LIU, Changliang , YOSHIOKA, Takuya , XIAO, Xiong , ERDOGAN, Hakan , DIMITRIADIS, Dimitrios, Basile
IPC: G10L21/0272 , G10L25/30 , G10L21/0216 , G10L21/0208
Abstract: A system and method include reception of a first plurality of audio signals, generation of a second plurality of beamformed audio signals based on the first plurality of audio signals, each of the second plurality of beamformed audio signals associated with a respective one of a second plurality of beamformer directions, generation of a first TF mask for a first output channel based on the first plurality of audio signals, determination of a first beamformer direction associated with a first target sound source based on the first TF mask, generation of first features based on the first beamformer direction and the first plurality of audio signals, determination of a second TF mask based on the first features, and application of the second TF mask to one of the second plurality of beamformed audio signals associated with the first beamformer direction.
-
-
-
-
-
-