TARGETED VOICE SEPARATION BY SPEAKER FOR SPEECH RECOGNITION
Abstract:
Processing of acoustic features of audio data to generate one or more revised versions of the acoustic features, where each of the revised versions of the acoustic features isolates one or more utterances of a single respective human speaker. Various implementations generate the acoustic features by processing audio data using portion(s) of an automatic speech recognition system. Various implementations generate the revised acoustic features by processing the acoustic features using a mask generated by processing the acoustic features and a speaker embedding for the single human speaker using a trained voice filter model. Output generated over the trained voice filter model is processed using the automatic speech recognition system to generate a predicted text representation of the utterance(s) of the single human speaker without reconstructing the audio data.
Public/Granted literature
Information query
Patent Agency Ranking
0/0