Multi-scale text filter conditioned generative adversarial networks
Abstract:
Systems and methods for processing video are provided. The method includes receiving a text-based description of active scenes and representing the text-based description as a word embedding matrix. The method includes using a text encoder implemented by neural network to output frame level textual representation and video level representation of the word embedding matrix. The method also includes generating, by a shared generator, frame by frame video based on the frame level textual representation, the video level representation and noise vectors. A frame level and a video level convolutional filter of a video discriminator are generated to classify frames and video of the frame by frame video as true or false. The method also includes training a conditional video generator that includes the text encoder, the video discriminator, and the shared generator in a generative adversarial network to convergence.
Information query
Patent Agency Ranking
0/0