-
公开(公告)号:US12073844B2
公开(公告)日:2024-08-27
申请号:US17601042
申请日:2020-10-01
Applicant: Google LLC
Inventor: Anatoly Efros , Noam Etzion-Rosenberg , Tal Remez , Oran Lang , Inbar Mosseri , Israel Or Weinstein , Benjamin Schlesinger , Michael Rubinstein , Ariel Ephrat , Yukun Zhu , Stella Laurenzo , Amit Pitaru , Yossi Matias
IPC: G10L21/0208 , G10L17/00 , G10L21/0272 , G10L25/57
CPC classification number: G10L21/0208 , G10L17/00 , G10L21/0272 , G10L25/57 , G10L2021/02087
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: receiving, by a user device, a first indication of one or more first speakers visible in a current view recorded by a camera of the user device, in response, generating a respective isolated speech signal for each of the one or more first speakers that isolates speech of the first speaker in the current view and sending the isolated speech signals for each of the one or more first speakers to a listening device operatively coupled to the user device, receiving, by the user device, a second indication of one or more second speakers visible in the current view recorded by the camera of the user device, and in response generating and sending a respective isolated speech signal for each of the one or more second speakers to the listening device.
-
公开(公告)号:US20240428816A1
公开(公告)日:2024-12-26
申请号:US18797400
申请日:2024-08-07
Applicant: Google LLC
Inventor: Anatoly Efros , Noam Etzion-Rosenberg , Tal Remez , Oran Lang , Inbar Mosseri , Israel Or Weinstein , Benjamin Schlesinger , Michael Rubinstein , Ariel Ephrat , Yukun Zhu , Stella Laurenzo , Amit Pitaru , Yossi Matias
IPC: G10L21/0208 , G10L17/00 , G10L21/0272 , G10L25/57
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: receiving, by a user device, a first indication of one or more first speakers visible in a current view recorded by a camera of the user device, in response, generating a respective isolated speech signal for each of the one or more first speakers that isolates speech of the first speaker in the current view and sending the isolated speech signals for each of the one or more first speakers to a listening device operatively coupled to the user device, receiving, by the user device, a second indication of one or more second speakers visible in the current view recorded by the camera of the user device, and in response generating and sending a respective isolated speech signal for each of the one or more second speakers to the listening device.
-
公开(公告)号:US11894014B2
公开(公告)日:2024-02-06
申请号:US17951002
申请日:2022-09-22
Applicant: Google LLC
Inventor: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
IPC: G10L25/57 , G10L15/16 , G10L21/10 , G10L21/18 , G06V20/40 , G06V40/16 , G10L15/25 , G06F18/214 , G10L17/18
CPC classification number: G10L25/57 , G06F18/214 , G06V20/41 , G06V40/161 , G10L15/16 , G10L15/25 , G10L17/18 , G10L21/10 , G10L21/18
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
-
公开(公告)号:US20230122905A1
公开(公告)日:2023-04-20
申请号:US17951002
申请日:2022-09-22
Applicant: Google LLC
Inventor: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
-
公开(公告)号:US20230267942A1
公开(公告)日:2023-08-24
申请号:US17601042
申请日:2020-10-01
Applicant: Google LLC
Inventor: Anatoly Efros , Noam Etzion-Rosenberg , Tal Remez , Oran Lang , Inbar Mosseri , Israel Or Weinstein , Benjamin Schlesinger , Michael Rubinstein , Ariel Ephrat , Yukun Zhu , Stella Laurenzo , Amit Pitaru , Yossi Matias
IPC: G10L21/0208 , G10L25/57
CPC classification number: G10L21/0208 , G10L25/57 , G10L2021/02087
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: receiving, by a user device, a first indication of one or more first speakers visible in a current view recorded by a camera of the user device, in response, generating a respective isolated speech signal for each of the one or more first speakers that isolates speech of the first speaker in the current view and sending the isolated speech signals for each of the one or more first speakers to a listening device operatively coupled to the user device, receiving, by the user device, a second indication of one or more second speakers visible in the current view recorded by the camera of the user device, and in response generating and sending a respective isolated speech signal for each of the one or more second speakers to the listening device.
-
公开(公告)号:US11456005B2
公开(公告)日:2022-09-27
申请号:US16761707
申请日:2018-11-21
Applicant: GOOGLE LLC
Inventor: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
-
公开(公告)号:US20200335121A1
公开(公告)日:2020-10-22
申请号:US16761707
申请日:2018-11-21
Applicant: GOOGLE LLC
Inventor: Inbar Mosseri , Michael Rubinstein , Ariel Ephrat , William Freeman , Oran Lang , Kevin William Wilson , Tali Dekel , Avinatan Hassidim
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for audio-visual speech separation. A method includes: obtaining, for each frame in a stream of frames from a video in which faces of one or more speakers have been detected, a respective per-frame face embedding of the face of each speaker; processing, for each speaker, the per-frame face embeddings of the face of the speaker to generate visual features for the face of the speaker; obtaining a spectrogram of an audio soundtrack for the video; processing the spectrogram to generate an audio embedding for the audio soundtrack; combining the visual features for the one or more speakers and the audio embedding for the audio soundtrack to generate an audio-visual embedding for the video; determining a respective spectrogram mask for each of the one or more speakers; and determining a respective isolated speech spectrogram for each speaker.
-
-
-
-
-
-