Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Abstract

Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called ``Reuse and Diffuse'' dubbed VidRD to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available here.

Model Architecture

Architecture of our model. — The architecture of our model for text-guided video synthesis.

The architecture of our video diffusion model derived from a text-to-image diffusion model. Modules with snowflake marks are frozen while those with flame marks are trainable. Modules with dashed boxes are newly added on the basis of the original 2D diffusion model. In this model, most network layers are initialized with the pre-trained model weights of Stable Diffusion including VAE and the spatial layers of U-Net. Only the parameters of two temporal modules, marked with dashed boxes, Temp-Conv representing 3D convolution layer and Temp-Attn representing the temporal attention layer are newly added and randomly initialized.

Iterative Video Generation

Videos can be generated clip-by-clip iteratively with a single LDM. After each iteration, N frames are generated and the last M frames are used as prompt frames for the next iteration. Three key strategies are proposed for generating natural and smooth videos. Frame-level Noise Reversion (FNR) is used as a basic module for re-using the initial noise in a reversed order from the last video clip. Past-dependent Noise Sampling (PNS) brings new random noise for the last several video frames. Temporal consistencies between video clips are refined by Denoising with Staged Guidance (DSG).

Quantitative Evaluation

Pipeline of our method. — Quantitative evaluation on UCF-101.

Following previous works like Make-A-Video and Video LDM, we use UCF-101, a dataset for video recognition, for evaluating FVD (Fréchet Video Distance) and IS (Inception Score). Since there are only 101 brief class names such as “knitting” and “diving” in UCF-101, we devise a descriptive prompt for each class for video synthesis in our experiments. Following Make-A-Video, 10K videos are generated by VidRD following the same class distribution as the training set. It is worth noting that the experimental settings in VideoFactory are slightly different from VidRD. VideoFactory generates 100 samples for each class. The results of quantitative evaluation demonstrate that VidRD achieves the best FVD and IS while using much fewer videos for training. Meanwhile, fine-tuning the decoder of VAE helps improve VidRD further. The reason is that a temporal-aware decoder can make restoring pixels from latent features more accurate.

BibTeX

@article{reuse2023,
  title     = {Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation},
  journal   = {arXiv preprint arXiv:2309.03549},
  year      = {2023}
}