Tsinghua alumni promote a brand new AI "video creates video" method! A single A100 generates a blockbuster where "men turn into apes in seconds"

Article Source: New Smart Element

This year, could it be the year of AI video generation models? UT Austin, in collaboration with the Meta team, has proposed a new V2V model, FlowVid, which can generate 4-second highly consistent videos in 1.5 minutes.

Image Source: Generated by Unbounded AI

NVIDIA's senior scientist Jim Fan believes that 2024 will be the year of AI video.

We have witnessed tremendous changes in the AI video generation field over the past year, with tools such as RunWay's Gen-2 and Pika's Pika 1.0 achieving high fidelity and consistency.

At the same time, diffusion models have completely changed the synthesis from image to image (I2I), and have gradually penetrated into video to video (V2V) synthesis.

However, the challenge faced by V2V synthesis is how to maintain temporal coherence between video frames.

Members of the University of Texas at Austin and the Meta GenAI team have proposed a V2V synthesis framework, FlowVid, that can maintain consistency.

It achieves highly consistent synthesis by using spatial conditions and temporal optical flow information from the source video.

Paper link: https://arxiv.org/abs/2312.17681

Researchers encode the first frame through optical flow transformation and use it as an auxiliary reference in the diffusion model.

This way, the model can use any popular I2I model to edit the first frame and propagate these editing effects to consecutive frames, achieving video synthesis.

It is worth mentioning that the latest method can generate a 4-second video, with 30 frames per second and a resolution of 512×512, in just 1.5 minutes.

At the same time, FlowVid seamlessly integrates with existing I2I models, supporting various modification methods, including stylization, object replacement, and local editing.

Netizens have dubbed it the new paper that changes the game rules.

Let's take a look at the powerful effects of FlowVid on video-to-video synthesis.

Demo

Original Video

Prompt: a woman wearing headphones, in flat 2d anime

Prompt: a Greek statue wearing headphones

Original Video

Prompt: a Chinese ink painting of a panda eating bamboo

Prompt: a koala eating bamboo

Original Video

Prompt: A pixel art of an artist's rendering of an earth in space

Prompt: An artist's rendering of a Mars in space

Original Video

Prompt: Ukiyo-e Art a man is pulling a rope in a gym

Prompt: A gorilla is pulling a rope in a gym

Prompt: a shirtless man is doing a workout in a park, with the Egyptian pyramids visible in the distance

Prompt: Batman is doing a workout in a park

Imperfect "Optical Flow" Control to Achieve Video Synthesis Consistency

Video-to-video (V2V) synthesis remains a challenging task. Compared to static images, videos have an additional time dimension.

Due to the ambiguity of the text, there are countless ways to edit frames to match the target prompts. However, directly applying I2I models to videos often results in unsatisfactory pixel flickering between frames, leading to inconsistencies.

To improve the coherence between frames in videos, researchers have attempted a method - editing multiple video frames simultaneously using a spatiotemporal attention mechanism.

While this method has shown some improvement, it has not fully achieved the smooth transition between frames that we desire. The issue lies in the fact that the motion in the video is only implicitly preserved in the attention module.

Additionally, other research has utilized explicit optical flow guidance in videos.

Specifically, it uses optical flow to determine the correspondence of pixels between video frames, enabling pixel-level mapping between two frames. Subsequently, it is used to generate masks for occlusions for image inpainting or to create a reference frame.

However, if the optical flow estimation is inaccurate, this strict correspondence can lead to various issues.

In this latest paper, researchers attempt to leverage the advantages of optical flow technology while addressing the shortcomings in optical flow estimation.

Specifically, FlowVid matches the initial frame's image to subsequent frames through optical flow distortion. These distorted frames maintain the same structure as the original frame but contain some occluded areas (gray), as shown in Figure 2(b).

If optical flow is used as a strict constraint, such as for inpainting occluded areas, inaccurate leg position estimation will persist.

Researchers attempt to combine additional spatial conditions (such as the depth map in Figure 2(c)) with temporal conditions. Because in the spatial condition, the leg position is correct.

Therefore, the spatial-temporal conditions can correct imperfect optical flow, resulting in consistent and accurate results as shown in Figure 2(d).

Video Diffusion Model FlowVid

In the paper, researchers established a video diffusion model based on the inflated spatial control of the I2I model.

We train this model to predict input videos using spatial conditions (such as depth maps) and temporal conditions (flow-distorted videos).

During the generation process, researchers use an edit-propagate process:

Edit the first frame using a popular I2I model.
Edit the content throughout the entire video using the trained model.

This decoupled design allows researchers to use an autoregressive mechanism: the last frame of the current batch can be the first frame of the next batch, enabling the generation of longer videos.

The overall process of FlowVid is as follows:

(a) Training: First, obtain spatial conditions (predicted depth maps) and estimate optical flow from the input video. Use optical flow to distort from the first frame for all frames. The video structure after optical flow distortion is expected to be similar to the input video but with some occluded areas (marked in gray, better visible when enlarged). Researchers use spatial conditions (c) and optical flow information (f) to train the video diffusion model.

(b) Generation: Edit the first frame using an existing I2I model and obtain the flow-distorted edited video using the optical flow from the input video. Here, the optical flow condition and spatial condition jointly guide the synthesis of the output video.

Crushing SOTA Results

Researchers conducted a user study on 25 DAVIS video sets and 115 manually designed test cases.

Preference rate refers to the frequency at which this method was chosen in human evaluations. Runtime refers to the time required to synthesize a 512x512 resolution, 4-second video on a computer equipped with an A100 80GB graphics card. Cost is normalized based on FlowVid.

The following is a qualitative comparison with representative V2V models.

The FlowVid method stands out in terms of timely alignment and overall video quality.

It is evident that directly applying ControNet to each frame still results in noticeable flickering, such as on the pirate's clothing and the tiger's fur.

CoDeF produces significantly blurred output results when there is a large amount of motion in the input video, such as on the person's hand and the tiger's face, which are quite noticeable.

Rerender often fails to capture large movements, such as the motion of the oar. Additionally, the color of the edited tiger's legs often blends into the background.

A pirate rowing on the lake

An oil painting of a walking tiger

A girl in Santa Claus costume standing in a snowy scene, 2D animation

In the quantitative comparison, researchers compared FlowVid with three models: CoDeF, Rerender, and TokenFlow.

As shown in the table below, FlowVid achieved a preference rate of 45.7%, significantly outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

Additionally, researchers compared the efficiency of the method with existing methods in Table 1. Due to different video lengths, the processing time varies.

Here, a 120-frame video (4 seconds, 30 FPS), with a resolution of 512×512, was used. Researchers generated 31 keyframes through two autoregressive evaluations and then used RIFE for interpolation of non-keyframes.

The total runtime, including image processing, model operations, and frame interpolation, was approximately 1.5 minutes.

This is significantly faster than CoDeF (4.6 minutes), Rerender (10.8 minutes), and TokenFlow (15.8 minutes), being 3.1 times, 7.2 times, and 10.5 times faster, respectively.

Ablation Experiments

In addition, researchers conducted color calibration and condition type ablation experiments.

As the evaluation process progressed from the first group to the seventh group, the results without color calibration appeared gray (left image). After applying FlowVid's color calibration method, the results appeared more stable (right image).

A man running on Mars

Canny edge detection provides more detailed control (suitable for stylized processing), while the depth map provides higher editing flexibility (suitable for object replacement).

Limitations

Of course, FlowVid still has certain limitations, including:

The edited first frame of the video does not structurally match the original first frame (as shown in the elephant video above), and obvious occlusion issues caused by fast motion (as shown in the ballet dancer video below).

Author Information

The first author, Feng Liang, is a doctoral student at the University of Texas at Austin.

Previously, he received his master's degree from Tsinghua University in 2019 and his bachelor's degree from Huazhong University of Science and Technology in 2016.

His research interests focus on efficient machine learning, multimodal learning, and their applications.

The corresponding author, Bichen Wu, is a researcher at Meta GenAI.

Prior to this, he received his doctoral degree from the University of California, Berkeley in 2019 and his bachelor's degree from Tsinghua University in 2013.

Reference:

https://huggingface.co/papers/2312.17681

https://arxiv.org/abs/2312.17681

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Tsinghua alumni promote a brand new AI "video creates video" method! A single A100 generates a blockbuster where "men turn into apes in seconds"

Demo

Imperfect "Optical Flow" Control to Achieve Video Synthesis Consistency

Video Diffusion Model FlowVid

Crushing SOTA Results

Ablation Experiments

Limitations

Author Information

Selected Articles by 巴比特

Table of Contents

Related Articles