Patent search ap:("ADOBE INC.") AND inv:"Yaman Kumar" Page 2

11.

发明公开
VISUAL SPEECH RECOGNITION FOR DIGITAL VIDEOS UTILIZING GENERATIVE ADVERSARIAL LEARNING 审中-公开

公开(公告)号：US20230252993A1

公开(公告)日：2023-08-10

申请号：US17650020

申请日：2022-02-04

Applicant: Adobe Inc.

Inventor： Yaman Kumar , Balaji Krishnamurthy

IPC: G10L15/25 , G06T9/00 , G06K9/62 , G06V20/40 , G06V10/82 , G10L15/16 , G10L13/02 , G10L15/22 , G10L25/57 , G06N3/02

CPC classification number: G10L15/25 , G06T9/002 , G06K9/6223 , G06V20/49 , G06V10/82 , G10L15/16 , G10L13/02 , G10L15/22 , G10L25/57 , G06N3/02

Abstract: This disclosure describes one or more implementations of systems, non-transitory computer-readable media, and methods that recognize speech from a digital video utilizing an unsupervised machine learning model, such as a generative adversarial neural network (GAN) model. In one or more implementations, the disclosed systems utilize an image encoder to generate self-supervised deep visual speech representations from frames of an unlabeled (or unannotated) digital video. Subsequently, in one or more embodiments, the disclosed systems generate viseme sequences from the deep visual speech representations (e.g., via segmented visemic speech representations from clusters of the deep visual speech representations) utilizing the adversarially trained GAN model. Indeed, in some instances, the disclosed systems decode the viseme sequences belonging to the digital video to generate an electronic transcription and/or digital audio for the digital video.

12.

发明授权
Pose-invariant visual speech recognition using a single view input 有权

公开(公告)号：US10937428B2

公开(公告)日：2021-03-02

申请号：US16298933

申请日：2019-03-11

Applicant: Adobe Inc.

Inventor： Yaman Kumar

IPC: G10L15/22 , G10L15/25 , G06N3/08 , G06N3/04 , G06K9/00

Abstract: A pose-invariant visual speech recognition system obtains a single view input of a speaker, such as a single video stream captured by a single camera. The single view input provides a particular pose of the speaker, which refers to a view angle, relative to the lens or image capture component of the camera that captured the video of the speaker, at which the speaker's face is captured. The pose of the speaker is used to select a visual speech recognition model to use to generate a text label that is the words spoken by the speaker. One or more additional view angles of the speaker are also generated from the single view input of the speaker. These one or more additional view angles, along with the single view input of the speaker, are used by the selected visual speech recognition model to generate the text label for the speaker.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification