Audio-visual speech separation

Invention Grant

US11894014B2 Audio-visual speech separation 有权

Please log in to see more content

Patent Title: Audio-visual speech separation
Application No.: US17951002

Application Date: 2022-09-22
Publication No.: US11894014B2

Publication Date: 2024-02-06
Inventor: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: Google LLC
Current Assignee: Google LLC
Current Assignee Address: US CA Mountain View
Agency: Fish & Richardson P.C.
Main IPC: G10L25/57
IPC: G10L25/57 ; G10L15/16 ; G10L21/10 ; G10L21/18 ; G06V20/40 ; G06V40/16 ; G10L15/25 ; G06F18/214 ; G10L17/18

Abstract:

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.

Public/Granted literature

US20230122905A1 AUDIO-VISUAL SPEECH SEPARATION Public/Granted day:2023-04-20

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L25/00	不限于组G10L 15/00-G10L 21/00的语言或者声音分析技术(当利用语音检测器来感知一些信号特殊特征的基于半导体的静噪放大器，如无信号时的感知入H03G3/34)
G10L25/48	.专门适用于特定用途
G10L25/51	..比较或判别
G10L25/57	...用于处理视频信号