Original Source: SenseAI
Image Source: Generated by Wujie AI
In the global competition of the new generation of AI unicorns, video generation technology is one of the most promising areas of focus. Recently, Google launched a large-scale language model called VideoPoet, which can not only generate videos from text and images, but also has functions such as style transfer and video audio frequency. Its richness and fluency in action generation are impressive, and it is widely regarded as a revolutionary zero-shot video generation tool. In this issue, we have the honor of interviewing the core author of VideoPoet, student Yu Lijun, and Google machine learning engineer Yishuai, to explore the technical thinking and applications together with SenseAI.
On the technical level of video generation, will there be more innovative frameworks within the existing technology framework, or is it currently unknown? The only certainty is that there will be a new round of iteration in video generation technology this year, leading to maturity and ultimately triggering another wave of application explosion.
On the application level of video generation, short videos take the lead. Consumer demand requires shorter durations and more flexible quality, while on the supply side, constrained by existing algorithm architecture and computing power consumption, the return on investment (ROI) is not yet commercially viable. In terms of content categories, there is optimism for anime, natural scenery, and educational content.
Future video generation will be a mix of dynamics: Filming will not be replaced and will remain an important source of material, but generation will be a good supplement, extension, and visualization of imagination.
Models are products: AI should adapt to and assist humans in every step of the human creation and visualization process. The premise is that the model has the minimum unit of multi-modal input capability and downstream generation editing capability, and the interaction with the model is simple and dynamic. At any time dimension and generation state, input and editing can be flexible, and the model will understand and generate on its own.
01 Background and Research Direction
In this podcast, guest Yu Lijun is currently a doctoral student in the field of artificial intelligence at Carnegie Mellon University. His academic journey began at Peking University, majoring in computer science and economics. His research at CMU mainly focuses on multimedia, collaborating with Dr. Alexander Hauptmann. Their team started with multimedia retrieval, gradually transitioned to video understanding, and ultimately focused on innovative video generation technology. Dr. Yu is particularly dedicated to the research of multi-modal large-scale models, focusing on multi-task generation. In addition, he has a long-term collaboration with Google, where his mentor is Professor Jiang Lu, a graduate of CMU and currently a research scientist at Google, focusing on video generation. Many of their important research at Google revolves around this theme.
02 Technical Architecture Q&A
SenseAI: Will the video generation model based on LLM have more potential and advantages in the long term compared to the Diffusion type model? Will there be a trend in the future where the quality of the content and logic of the LLM architecture will be more prominent, while the quality of the content generated by Diffusion architecture will be very close? Or will there be other trends?
Dr. Yu: This is a very good question, and I generally agree with the points raised in the question. Because LLM has developed very well in the language field, it has strong logical and reasoning abilities, and now it also has very good multi-modal generalization capabilities. So, I believe that using LLM as the backbone for video generation will be better in terms of scalability and logical consistency than the future Diffusion model. Of course, this is based on our current observations, and perhaps one day the Diffusion Model will also make some progress. But in terms of visual quality, it may gradually reach saturation in the future, and we already see some products that have reached a certain level of possibility. In the future, we may focus more on content advancement, and these two technological paths are not mutually exclusive. We can also combine LLM as the Latent Model to utilize its multi-modal zero-shot and logic, and then combine it with the high-quality capabilities of Diffusion for the final step, from the Latent space back to the Pixel space. It may also be a mixed architecture in the future.
SenseAI: Please introduce the unique architecture design of VideoPoet.
Dr. Yu: It is a conceptually very simple model. We use a Causal Transformer structure of a large language model. This Transformer operates entirely in Token Space, including image and video tokens, as well as audio tokens, and text embeddings. We unify these modalities into the Token space using a specific tokenizer for each modality. For images and videos, we use the MAGVIT-v2 Tokenizer I designed previously. We can tokenize images and videos of any length into a space, and it has a high reconstruction effect, ensuring the quality of our video generation. For the audio part, we use the mature SoundStream Tokenizer. For the text part, we use the existing T5 for embedding. These modalities are mixed together, and we conduct a large amount of multi-modal, multi-task pre-training, enabling it to generate video from text, image to video, video to audio, style transfer, video editing, and various other applications.
(Reference: https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html)
SenseAI: Training this LLM based on a multi-modal vocabulary and being able to generate high-fidelity, long-duration, and highly complex action videos as you mentioned. What value does the speech model provide in it? Do we have high requirements for the choice of language model?
Dr. Yu: In this context, the language model is a crucial model. Of course, the choice of tokenizers is also important, as each tokenizer compresses each modality to a certain extent, allowing the language model to learn better. Ultimately, we put all modalities onto tokens, and each generation task is learned by the language model. After extensive pre-training, it can generalize and transfer well. There are high requirements for model selection. Currently, we need a large number of parameters to be placed in the language model to exhibit the capabilities we currently demonstrate. However, in terms of specific architecture comparison, such as LLaMA, GPT, PaML, or Gemini, I don't think there will be a significant impact at the moment. The fact that it is a Causal Language Model is very important.
SenseAI: This also ensures or helps our model to evolve continuously with the evolution of the backbone, right?
Dr. Yu: Yes, we can always use the latest research in the language space to improve the quality of our video generation and multi-modal generation.
SenseAI: Understood, because you mentioned the use of the MAGVIT tokenizer earlier, we would like to understand what performance aspects we are particularly concerned about in the selection of this tokenizer, and how it helps with the stability of our video generation. In the future, we may also try some other types of tokenizers.
Dr. Yu: The tokenizer, based on a transformer backbone, is a very important module in the language model's video generation model. We started this series quite early, around last summer. At that time, we initially developed a version of the 3D tokenizer, which was the best on the market at the time. Then we tried to scale up this tokenizer, and after the transformer model reached a certain level, we found that it was still bottlenecked by the tokenizer. So, around this summer, we started researching the MAGVIT-V2 tokenizer, with two main goals. One is to significantly improve the visual quality of the video. At the same time, we also hope to use as much larger vocabulary as possible. This is because previous visual tokenizers usually had a vocabulary size of only 1000 to 8000, which is very small for a language model and does not fully utilize its large-scale parameter capabilities. Our common language models usually have around 200K, so in MAGVIT-V2, through an innovative quantization method, we can scale the vocabulary to 200K or even larger. At the same time, we made a change in MAGVIT-V2, where we no longer use a pure 3D model. We found that it is much better than 2D, but an even better variant is causal 3D modeling, which takes into account the natural properties of the video on the time axis, always relying only on the previous frames. This makes the first frame independent, allowing for joint tokenization of images and videos, as well as infinite tokenization of videos. When combined with causal LLM, it makes the prediction of subsequent tokens much simpler, as it is always a one-way dependency.
In the future, I think there is still a lot of room for improvement in tokenizers, as they are still relatively small models with only a few hundred million parameters, which is very small compared to our VideoPoet Transformer, and scalability may still be a bottleneck. We will explore how to scale up this model and modify some of its current training objectives, such as replacing the GAN loss with diffusion or consistency, which is also worth researching.
(Reference: https://magvit.cs.cmu.edu/v2/)
SenseAI: In the future, any place that uses image encoding can use this tokenizer. In this case, is it possible for diffusion-type models to also use the MAGVIT tokenizer first? Is there such a possibility?
Dr. Yu: This is a very good question. We hope that in the future, any place that needs to use an image encoder can use it. At the same time, when we designed MAGVIT-V2, we conducted a comprehensive evaluation. First, we achieved certain effects on standard benchmarks. At the same time, we also used it for video compression and found that at the same bandwidth, its compression quality, when reconstructed, is better than what current vendors use with H265. It can even compete with the next-generation algorithm called H266VVC. Of course, currently, during compression and decompression, it may require more GPU or CPU resources. Thirdly, we also conducted evaluations in video understanding and found that the tokens of this tokenizer are helpful for self-supervised learning and applications such as action recognition. As for the diffusion model, this is also a very good question. Recently, there is another work that uses the MAGVIT-V2 encoder and decoder, and in the latent space of this encoder and decoder, a latent diffusion is performed. So, the diffusion model can also use our tokenizer, or more accurately, this type of tokenizer has been tested with several transformers, all of which have performed very well.
SenseAI: The richness and rationality of actions have always been a problem in video generation. In our discussions with other teams, they also mentioned concepts similar to world models, which involve understanding the interaction of environmental objects, to solve this problem. What is your view on this, and are there continuous optimizations in this area?
Dr. Yu: The richness of actions, I think, has seen a lot of improvement in 2023, and everyone has witnessed it. The key point here is that at the beginning, everyone started with models like stable diffusion as a secondary model, and then we added a little bit of temporal attention or temporal convolution to transform it into a video model. However, in this process, its modeling of time is relatively weak, which is why we see that it doesn't move much, and the richness of its actions is relatively poor. The MAGVIT series, including other recent works, uses native 3D modeling. With 3D modeling, we simultaneously learn the transformations in time and space. As a result, the videos we obtain have a greater range of motion and better coherence. Of course, for local actions, how we can achieve more coherent, richer, and even more reasonable actions on a larger scale may depend on our intermediate large model. As its parameter capacity increases, its ability is enhanced, and, as you mentioned, it may have a deeper understanding of the world, possibly learning the physical laws of the world, so the content it generates follows human common sense and looks very reasonable. I believe it still needs some time to progress, but I do believe in its potential. This may be a direction for future research in building large multimodal models. Language may not be essential, but through the observation of the natural world, learning these laws places higher demands on the model, which is also a direction worth exploring in the future.
SenseAI: There is a small detail mentioned in the latter part of the paper about super resolution. You mentioned that other papers also use it, so we are wondering if every video generation model needs to use it. How much exploratory space is there in this area currently? Is the architecture relatively universal, especially for LLM and Diffusion types?
Dr. Yu: The need for super resolution ultimately arises because our native model outputs relatively low resolution and duration, so we need to graft another model onto it to achieve a better-looking effect. Ultimately, when our native model can output higher resolution, as is the case with some image work now, it may no longer need super resolution. However, at this stage, because video is a high-dimensional field, it also has higher efficiency requirements, so it may still need to use super resolution for some time. As for the universality of the architecture, the architecture we currently use is a mask transformer, which is faster for super resolution. It is not a diffusion model, and it is faster than diffusion. Common diffusion models use their own diffusion objective to learn super resolution. At least so far, there is not a high degree of sharing in the technical route of super resolution. I think it may be possible in the future. However, there is a problem with super resolution. Currently, everyone uses teacher forcing, where the original low-resolution video and the original high-resolution video are used for learning. A better method may be to use the model's output from the previous stage, learning from low resolution to high resolution, which is called student forcing. This would make the distribution shift smaller, but it would require specialized training for super resolution for each model, which may reduce its universality.
SenseAI: We are also curious about data. Data collection and processing have always been crucial for video generation. Can you tell us about the selection work we have done in Videopoets? Do we have methods or tools for large-scale data processing? Also, with the shortage of video data now, have we observed a type of data that could greatly help the generation effect but has not been collected yet?
Dr. Yu: Data is indeed a very important aspect for models, as seen in many works. However, our main focus in this work is on the model, so there is relatively less emphasis on data processing. From other research works, we have seen that the selection, organization, and labeling of data are very important for the model, especially in terms of the impact on aesthetic-related generation quality.
SenseAI: In addition to data, the paper also mentions some analysis on social responsibility and fairness, which is a very forward-thinking design. Can you talk about how we specifically address and balance this?
Dr. Yu: Some of our collaborators have conducted an analysis of the social fairness of the model, and there are some very interesting findings. We found that the model's outputs tend to prefer younger people, such as those aged 18 to 35, as well as males and people with lighter skin tones, when we select a set of prompts. We will make efforts to design some prompts to make the final output distribution closer to the real distribution. However, this research is still in its early stages, and we hope that these observations can be used to optimize the model from a data perspective in the future, making our model more responsible and minimizing bias as much as possible.
SenseAI: We are also curious about where VideoPoet will continue to optimize, including the combination of multimodal inputs and the internal structure design of our language model. Are there still some innovations to be made? In terms of performance, where do we hope to continue optimizing?
Dr. Yu: Indeed, as we were the first to create a language model-style video generation, there is still a lot of room for improvement in this model. The imagination space that the language model brings is very large, such as its strong zero-shot capability and even in-context learning. One direction of development is whether we can further scale this model, as it may not have any task-specific design during pre-training. Instead, during inference, we can give it a small amount of instruction tuning or even just a few examples to teach it new tasks. For example, we can teach it video segmentation at a very low cost. This indicates that our model may already have a strong understanding of the world, and even at a relatively low cost, it can learn Newton's laws. These are very interesting things from a research perspective. From an application perspective, we can do customized generation at a very low cost, which is also a very interesting application.
In terms of performance optimization, for models like ours, as well as other video generation models, the biggest bottleneck is the length of time it takes to generate. Additionally, at a fixed time length, we hope to achieve as high a resolution as possible natively. This brings us back to the backbone of LLM. We hope that it can provide stronger support for long context, for example, reaching several hundred K. At the same time, the efficiency should not decrease too much, meaning that within a reasonable cost range, we can support the generation of longer and higher resolution videos. Furthermore, within a single task, it should learn all of these things, which will greatly improve the logical coherence of multimodal combined content.
SenseAI: Recently, we have actually seen a lot of excellent video generation papers coming out. Do you have the feeling that the renaissance of video generation is coming, and are there still opportunities for new architecture disruptions in the short term with some scaling solutions for existing architectures?
Dr. Yu: We have seen a lot of work recently, and the field of video generation is currently thriving. I am still very confident, and I think that in 2024, video generation may truly move towards practical applications. As for whether there will be new architectures or just scaling within existing architectures, I think that in a few months, we may still see another round of technological iterations. I think it may move towards maturity by the end of the year.
Reference Materials:
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。