- 专利标题: AUDIO-VISUAL SPEECH SEPARATION
-
申请号: US16761707申请日: 2018-11-21
-
公开(公告)号: US20200335121A1公开(公告)日: 2020-10-22
- 发明人: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
- 申请人: GOOGLE LLC
- 国际申请: PCT/US2018/062330 WO 20181121
- 主分类号: G10L21/10
- IPC分类号: G10L21/10 ; G10L21/18 ; G10L15/16 ; G06K9/00 ; G06K9/62
摘要:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
公开/授权文献
- US11456005B2 Audio-visual speech separation 公开/授权日:2022-09-27