SYSTEM AND METHOD FOR EFFICIENT TEXT-GUIDED GENERATION OF HIGH-RESOLUTION VIDEOS

    公开(公告)号:US20250111552A1

    公开(公告)日:2025-04-03

    申请号:US18819064

    申请日:2024-08-29

    Abstract: Systems and methods are disclosed that train a content frame-motion latent diffusion model (CDM) and use the CDM to generate requested videos. The CMD may be a two-stage framework that first compresses videos to a succinct latent space and then learns the video distribution in this latent space. For instance, the CMD may include an autoencoder and two diffusion models. In a first stage, using the autoencoder, a low-dimensional latent decomposition into a content frame and latent motion representation is learned. In the second stage, without adding any new parameters, the content frame distribution may be fine-tuned by using a pretrained image diffusion model, which allows the CMD to leverage the rich visual knowledge in pretrained image diffusion models. In addition, a new lightweight diffusion model may be used to generate motion latent representations that are conditioned on the given content frame.

Patent Agency Ranking