-
11.
公开(公告)号:US20230252993A1
公开(公告)日:2023-08-10
申请号:US17650020
申请日:2022-02-04
Applicant: Adobe Inc.
Inventor: Yaman Kumar , Balaji Krishnamurthy
IPC: G10L15/25 , G06T9/00 , G06K9/62 , G06V20/40 , G06V10/82 , G10L15/16 , G10L13/02 , G10L15/22 , G10L25/57 , G06N3/02
CPC classification number: G10L15/25 , G06T9/002 , G06K9/6223 , G06V20/49 , G06V10/82 , G10L15/16 , G10L13/02 , G10L15/22 , G10L25/57 , G06N3/02
Abstract: This disclosure describes one or more implementations of systems, non-transitory computer-readable media, and methods that recognize speech from a digital video utilizing an unsupervised machine learning model, such as a generative adversarial neural network (GAN) model. In one or more implementations, the disclosed systems utilize an image encoder to generate self-supervised deep visual speech representations from frames of an unlabeled (or unannotated) digital video. Subsequently, in one or more embodiments, the disclosed systems generate viseme sequences from the deep visual speech representations (e.g., via segmented visemic speech representations from clusters of the deep visual speech representations) utilizing the adversarially trained GAN model. Indeed, in some instances, the disclosed systems decode the viseme sequences belonging to the digital video to generate an electronic transcription and/or digital audio for the digital video.
-
公开(公告)号:US10937428B2
公开(公告)日:2021-03-02
申请号:US16298933
申请日:2019-03-11
Applicant: Adobe Inc.
Inventor: Yaman Kumar
Abstract: A pose-invariant visual speech recognition system obtains a single view input of a speaker, such as a single video stream captured by a single camera. The single view input provides a particular pose of the speaker, which refers to a view angle, relative to the lens or image capture component of the camera that captured the video of the speaker, at which the speaker's face is captured. The pose of the speaker is used to select a visual speech recognition model to use to generate a text label that is the words spoken by the speaker. One or more additional view angles of the speaker are also generated from the single view input of the speaker. These one or more additional view angles, along with the single view input of the speaker, are used by the selected visual speech recognition model to generate the text label for the speaker.
-