PHOTO-REALISTIC SYNTHESIS OF IMAGE SEQUENCES WITH LIP MOVEMENTS SYNCHRONIZED WITH SPEECH
    7.
    发明申请
    PHOTO-REALISTIC SYNTHESIS OF IMAGE SEQUENCES WITH LIP MOVEMENTS SYNCHRONIZED WITH SPEECH 有权
    具有与语音同步的LIP运动的图像序列的照片 - 现实综合

    公开(公告)号:US20120284029A1

    公开(公告)日:2012-11-08

    申请号:US13098488

    申请日:2011-05-02

    IPC分类号: G10L21/00

    CPC分类号: G10L21/10 G10L2021/105

    摘要: Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with lip movements synchronized with the desired speech.

    摘要翻译: 读取已知脚本的个人的视听数据被获取并存储在音频库和图像库中。 处理视听数据以提取用于训练统计模型的特征向量。 提供了与合成图像序列将被同步的期望语音相对应的输入音频特征向量。 统计模型用于生成对应于输入音频特征向量的视觉特征向量的轨迹。 这些视觉特征向量用于识别来自图像库的匹配图像序列。 从图像库连接的所得到的图像序列提供了与期望语音同步的唇部运动的照片级逼真图像序列。

    Minimum Converted Trajectory Error (MCTE) Audio-to-Video Engine
    8.
    发明申请
    Minimum Converted Trajectory Error (MCTE) Audio-to-Video Engine 有权
    最小转换轨迹误差(MCTE)音频到视频引擎

    公开(公告)号:US20120116761A1

    公开(公告)日:2012-05-10

    申请号:US12939528

    申请日:2010-11-04

    IPC分类号: G10L15/00

    摘要: Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.

    摘要翻译: 公开了音频到视频引擎的实施例。 在操作中,音频到视频引擎基于输入语音产生面部动作(例如,虚拟通话头)。 音频到视频引擎接收输入语音并将输入语音识别为源特征向量。 音频到视频引擎然后基于源特征向量确定最大后验(MAP)混合序列。 MAP混合序列可以是精细高斯混合模型(GMM)的函数。 音频到视频引擎然后可以使用MAP估计视频特征参数。 视频功能参数被解释为面部动作。 面部运动可以作为数据存储到存储模块和/或其可以作为视频显示到显示装置。