MULTI-MICROPHONE SPEECH SEPARATION
    1.
    发明申请

    公开(公告)号:WO2019199554A1

    公开(公告)日:2019-10-17

    申请号:PCT/US2019/025686

    申请日:2019-04-04

    Abstract: This document relates to separation of audio signals into speaker-specific signals. One example obtains features reflecting mixed speech signals captured by multiple microphones. The features can be input a neural network and masks can be obtained from the neural network. The masks can be applied one or more of the mixed speech signals captured by one or more of the microphones to obtain two or more separate speaker-specific speech signals, which can then be output.

    ARRAY GEOMETRY AGNOSTIC MULTI-CHANNEL PERSONALIZED SPEECH ENHANCEMENT

    公开(公告)号:WO2023059402A1

    公开(公告)日:2023-04-13

    申请号:PCT/US2022/040979

    申请日:2022-08-22

    Abstract: Examples of array geometry agnostic multi-channel personalized speech enhancement (PSE) extract speaker embeddings, which represent acoustic characteristics of one or more target speakers, from target speaker enrollment data. Spatial features (e.g., inter-channel phase difference) are extracted from input audio captured by a microphone array. The input audio includes a mixture of speech data of the target speaker(s) and one or more interfering speaker(s). The input audio, the extracted speaker embeddings, and the extracted spatial features are provided to a trained geometry-agnostic PSE model. Output data is produced, which comprises estimated clean speech data of the target speaker(s) that has a reduction (or elimination) of speech data of the interfering speaker(s), without the trained PSE model requiring geometry information for the microphone array.

    SYSTEMS AND METHODS FOR HUMAN LISTENING AND LIVE CAPTIONING

    公开(公告)号:WO2022250849A1

    公开(公告)日:2022-12-01

    申请号:PCT/US2022/026868

    申请日:2022-04-29

    Abstract: Systems and methods are provided for generating and operating a speech enhancement model optimized for generating noise-suppressed speech outputs for improved human listening and live captioning. A computing system obtains a speech enhancement model trained on a first training dataset to generate noise-suppressed speech outputs and an automatic speech recognition model trained on a second training dataset to generate transcription labels for spoken language utterances. A third training dataset comprising a set of spoken language utterances is applied to the speech enhancement model to obtain a first noise-suppressed speech output which is applied to the automatic speech recognition model to generate a noise-suppressed transcription output for the set of spoken language utterances. Speech enhancement model parameters are updated to optimize the speech enhancement model to generate optimized noise-suppressed speech outputs based on a comparison of the noise-suppressed transcription output and ground truth transcription labels.

    HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO

    公开(公告)号:WO2022132405A1

    公开(公告)日:2022-06-23

    申请号:PCT/US2021/060423

    申请日:2021-11-23

    Abstract: A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

    LOW-LATENCY SPEECH SEPARATION
    7.
    发明申请

    公开(公告)号:WO2020205097A1

    公开(公告)日:2020-10-08

    申请号:PCT/US2020/019851

    申请日:2020-02-26

    Abstract: A system and method include reception of a first plurality of audio signals, generation of a second plurality of beamformed audio signals based on the first plurality of audio signals, each of the second plurality of beamformed audio signals associated with a respective one of a second plurality of beamformer directions, generation of a first TF mask for a first output channel based on the first plurality of audio signals, determination of a first beamformer direction associated with a first target sound source based on the first TF mask, generation of first features based on the first beamformer direction and the first plurality of audio signals, determination of a second TF mask based on the first features, and application of the second TF mask to one of the second plurality of beamformed audio signals associated with the first beamformer direction.

Patent Agency Ranking