Source: Synced
AI cannot understand videos by predicting in pixel space.
Image Source: Generated by Wujie AI
As internet text data is about to dry up, many AI researchers have turned their attention to videos. However, how to make AI understand video data has become a new challenge.
During a discussion at the 2024 World Economic Forum, Turing Award winner and Chief AI Scientist of Meta, Yann LeCun, was asked about this issue. He believes that although there is no clear answer to this question, the models suitable for processing videos are not the generative models widely used at present. Moreover, new models should learn to predict in an abstract representation space, rather than in pixel space.
Also participating in the discussion was Daphne Koller, a professor at Stanford University and co-founder of Coursera. Her research focuses on artificial intelligence and its applications in biomedical sciences. She emphasized the importance of understanding causal relationships in building future AI systems.
The following is the textual version of the video:
Host: I have participated in some discussions at the World Economic Forum. They said that our data is running out, is that true? There isn't much left on the internet?
Daphne Koller: It's true.
Host: But autonomous vehicles may provide more data. Yann, what do you think?
Yann LeCun: I completely agree with Daphne's point. Of course, if we focus on LLM, or autoregressive LLM, we can see that their development is heading towards the extreme. There is no doubt that data resources are becoming increasingly scarce, and we have basically used all the public data on the internet. Small LLMs are trained with ten trillion tokens. Calculated at about 2 bytes per word, the total amount of data used for training is about 2*10¹³ bytes, which, at the average reading speed of a person, would take 150,000 to 200,000 years to read.
Imagine how much a child sees through their eyes. For example, a four-year-old child, let's try to quantify the amount of information Ta sees in life: the optic nerve transmits about 20 megabytes of data per second, and in the first four years of the child's life, there are 16,000 waking hours, with 3,600 seconds per hour, which can be calculated to be a total of one terabyte of information. It can be seen that the total amount of information seen by a four-year-old child is 50 times that of the data consumed by the largest LLM we have.
A four-year-old child is much smarter than the largest LLM we have. Ta's accumulated knowledge seems to be less, but that's because it's in a different form. In fact, for this child, Ta has a rich understanding of how the world works, something that we cannot achieve with LLM today. We still need to invent some new scientific methods and technologies to enable future AI systems to use the information they see like a child. This will require some scientific and technological breakthroughs, which may occur in one, three, five, or ten years, and it's hard to say the exact time because it's a challenge.
Host: Let me confirm if I understand your point. The amount of available text data may increase, but it is not unlimited. And the amount of visual data that we can input into these machines is huge, far exceeding text data.
Yann LeCun: The 16,000 hours of visual content I mentioned earlier is equivalent to 30 minutes of upload on YouTube. This means that we have far more data than we can handle. The problem is, how do we make machines learn from videos? We don't know.
Host: So, if the next step is to process video input, what kind of new architecture is needed? Obviously, large language models are not a good choice, their construction is not suitable for processing videos, so what do we need to build now?
Yann LeCun: Large language models or general NLP systems are usually trained using this method. Take a piece of text, deliberately delete some parts, and then use a huge neural network to reconstruct the text, that is, to predict the deleted words, that is, to "disrupt" the text by deleting some words. Models like ChatGPT and Lama are trained this way. You only need to delete the last word to train them. Technically, it's actually more complicated, but that's the general idea, training the system to reconstruct the missing information in the input.
An obvious idea arises, why don't we try using images? Take a picture, damage the image by removing a small part, and then train a large neural network to restore it. But this doesn't work, or the effect is not good. There have been many attempts in this area, but they have not been very successful. The same goes for videos.
I have been studying video prediction for nine years. I have been trying to predict, that is, to show a video to the system and then train it to predict what will happen next. If the system can do this, it may be able to understand some basic rules of the world, just like text systems trying to predict the next word. It has to be able to understand the meaning of the sentence, but it can't do that either.
Host: Are you saying that if you shoot a video, and you have someone in the video holding a pen up and then letting go, I should be able to predict that the pen will fall. But machines cannot do that now?
Yann LeCun: The main problem is that your pen has a specific way of being placed. When you drop it, it will fall along a specific trajectory. Most of us cannot accurately predict what the trajectory is, but we can predict that the object will fall. It takes a baby about nine months to understand that unsupported objects will fall. This intuitive knowledge of physics, a baby can learn in nine months, so how do we make machines do the same?
Host: Wait a minute, I want to ask a possibly silly question. If these technologies are to be effective and continuously innovative, they need to be able to understand videos, after all, the data is in the videos. But we ourselves don't fully understand videos, how can this contradiction be resolved?
Yann LeCun: In fact, there is currently no real solution. But the most promising approach at the moment, which may surprise everyone when I say it, is not generative.
So the most effective model is not generating images, not reconstructing, and not directly predicting. It predicts in an abstract representation space, just like I cannot accurately predict how the pen in your hand will fall. But I can predict that it will fall. At some abstract level, I can make predictions without knowing the specific location of a pen, its exact placement, and other specific details.
So, we need to predict in an abstract representation space, not in the specific pixel space. This is why predictions in the pixel space have all failed, because it's too complex.
Daphne Koller: But this is not just about videos. I think another thing that babies learn is the concept of causality. They learn by intervening in the world and observing what happens. And our LLMs haven't done that yet. They are purely predictive engines, just building associations, without truly understanding causality. Understanding causality is extremely important for the interaction between humans and the material world, especially when we try to connect digital information with the physical world. This is a very important ability missing in current models. This ability is missing in practical application models, and it is also missing in the ability of computers to perform common-sense reasoning. When we try to apply it to other fields, whether it's manufacturing, biology, or any field that interacts with the physical world, this ability is also missing.
Yann LeCun: In embodied systems, it is actually effective. Some systems are built on models of the world. For example, here is a model of the world state at time t, and here are the actions I might take. Think about it, what will the world state be at time t+1? This is the so-called world model. If you have this kind of world model, you can plan a series of actions to achieve a specific goal.
Currently, we don't have any AI systems based on this principle, except for very simple robotic systems. They don't learn very quickly. So, once we can scale up this model, we will have systems that can understand the world, understand the physical world. They can plan, reason, and understand causality. Because they know what effect an action might have. They will be goal-oriented. We can use this planning to set goals for them, and that is the future architecture of artificial intelligence systems. In my opinion, once we figure out how to achieve all of this, no one will want to use the current way anymore.
Original video link: https://www.weforum.org/events/world-economic-forum-annual-meeting-2024/sessions/the-expanding-universe-of-generative-models/
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。