Mei Tao: HiDream.ai's video generation has broken through the industry's 4-second bottleneck and can now support over 15 seconds.

Image Source: Generated by Wujie AI

The war of text-to-image is not over yet, but video generation is already accelerating.

After Pika became popular, the advancement of video generation technology has become a new focus for the public's perception of AIGC. Musk directly predicted that next year will be the "year of AI movies."

Compared to the generation effect of text-to-image, the current AI video generation effect on the market can be described as unsatisfactory in practical experience.

Under the 4-5 second time limit, the level of understanding of the prompt by common tools in practical tests fluctuates. Inputting "a cat playing the violin in the forest" may result in a series of issues such as a cat with a human body, one hand being a cat's paw and the other being a human hand, a cat without a violin, or a violin without being played.

In practical applications, today's text-to-video technology faces many issues such as accuracy, consistency, and time constraints, still far from the vision of "AI movie."

Synced previously interviewed HiDream.ai, a visual multimodal large model company, in June. They recently informed us that in the field of video generation, their upcoming new product can break the 4-second time limit commonly faced by Runway and Pika, and support a generation duration of around 15 seconds.

HiDream.ai was established in March this year by Dr. Meitao, former Vice President of JD.com and former senior researcher at Microsoft Research Institute. Dr. Meitao is a foreign academician of the Canadian Academy of Engineering, an IEEE/IAPR/CAAI Fellow, the Chinese scholar with the most international best paper awards in the field of multimedia (15 awards), and the chief scientist of the Ministry of Science and Technology's 2030 AI major project.

Dr. Meitao told Synced that the HiDream.ai team has innovated their own approach in text-to-video: not directly converting from text to video, but first converting from text to images, generating keyframes, and then expanding them in the time dimension.

"This method not only improves the stability, detail processing, and aesthetics of video generation, but also provides the possibility of extending the duration of video generation. Starting from a short prompt, a large language model can automatically generate a script for each shot. Then, for each shot script, a keyframe is generated in the manner of 'text-to-image,' and these keyframes are converted into single-shot videos through 'image-to-video' methods. Finally, these videos are spliced into a complete video, forming a multi-shot video of 15 seconds or even longer."

The time limit of video generation is an important constraint in commercial applications. With a 15-second generation length, it can basically cover the common needs for short video generation, while 4 seconds is very challenging.

HiDream.ai's new solution for text-to-video is mainly based on their own genes. At the beginning of its establishment, HiDream.ai invested a large amount of resources in text-to-image and developed its own multimodal basic model.

In an interview with Synced six months ago, Dr. Meitao set a goal of "surpassing the latest version of Stable Diffusion on the basic model by the end of this year, and surpassing Midjourney on the product side." After half a year, Dr. Meitao told Synced that this goal was achieved ahead of schedule in November this year.

In the past six months, based on their underlying visual large model, HiDream.ai has launched the image generation platform "Pixeling Qianxiang" and the AIGC tool "PixMaker" for e-commerce platforms, both of which have made certain progress in commercialization.

In terms of funding, in addition to receiving support from the first round of investors composed of USTC alumni, HiDream.ai has also received support from USTC iFLYTEK Venture Capital Fund and Alpha Startups in the past six months, raising nearly 100 million RMB in two rounds of financing. Dr. Meitao revealed that the third round of financing is now underway, and the new round of financing is expected to be completed in Q1 2024.

Breaking the 4-second bottleneck in the industry, now able to support 15-second video generation

Synced: Pika has been particularly popular recently. From your perspective, what might be the reason for its popularity and why has it attracted so much attention?

Dr. Meitao: We usually do not comment on any specific company. However, from the perspective of the entire industry, six months ago when we proposed to do multimodal video generation, no one believed it could be achieved. But now, some initial results have been seen.

In fact, whether it's Runway or Pika, although the product functionality is still relatively basic, these companies have gradually gained attention, which is a good sign, indicating that everyone has realized the huge market potential of video generation.

Synced: Six months have passed. What stage is the technological progress of text-to-video in the entire field at?

Dr. Meitao: Taking image generation as an example, if we say GPT has versions 1.0, 2.0, 3.0, and 4.0, image generation should be at the 2.0 stage. The largest model we are currently working on is 10 billion, which is basically similar to GPT 2.0, and we may be moving towards the 3.0 direction. However, video generation may still be at an earlier 1.0 stage.

There are many technical challenges in the field of video generation. Firstly, a good video generation model requires an excellent image generation model, which is a high threshold for video generation.

Secondly, video generation also needs to effectively handle the continuity and consistency of movements within a single shot. If pixel-level motion prediction in the time domain cannot be achieved, motion blur or illogical distortion may occur.

Thirdly, in terms of video editing, it is currently very difficult to modify specific elements, such as changing the movement of a certain object or character, because maintaining continuity and consistency is very challenging. At the same time, achieving consistency for a specific IP across multiple shots in long video generation is also very challenging.

We now believe that from the perspective of video generation itself, directly generating videos from text is not reliable.

Firstly, it involves a transition from a one-dimensional signal to a three-dimensional signal, actually crossing a two-dimensional signal, which is an extremely difficult technical problem to recover a high-dimensional signal from a low-dimensional signal.

Secondly, there is a lot of uncertainty in text-to-video. Typically, when you input a text prompt, given the current computing power, you may need to wait several minutes to get a 4-second video result, and this result may not be what you want. Later, we found that a considerable number of our users, when using our product, first generate keyframes using a prompt, and then expand them in the time dimension, which makes the results more controllable.

So, I believe that although many companies today have good ideas for generating videos from text, our team believes that the transition from text to images and then to videos may be better.

In the future, we may continue this approach, achieving a transition from text to images to single-shot videos, and then to multiple shots and storylines. This means that within the system, we need to help users write scripts, then divide a simple prompt into different shots, and each sub-shot can be solved using our current method. That is, first solve the "shooting" process in the traditional video production process.

Of course, we also need to continue to address the continuity inside and outside the shots, including the consistency and coherence of motion discussed today, which means that the same person or IP should maintain consistency in the next shot after appearing in one shot, and should not change arbitrarily. This is actually a challenge. Therefore, our approach is basically from single shots to multiple shots, from simple semantics to complex script generation.

Synced: What might be the differences in the effects between directly generating videos from text and first generating images from text and then generating videos?

Dr. Meitao: For text-to-video, users usually do not perceive whether the generation process has gone through the stages of text-to-image and then image-to-video.

From a design perspective, our product still follows the process for users, which is to input a prompt, wait, and then get the video. But for our system, first converting text to images and then generating videos brings more certainty to the process. Directly generating videos from text may lead to more distortion or mutation; but because the quality of text-to-image generation is relatively high, the process from images to videos is more controllable.

We believe that the method of transitioning from text to images and then to videos will improve the stability, detail processing, visual authenticity, and aesthetics of video generation, all based on the feedback we have received from users.

In addition, even if users choose to generate videos with one click, we still implicitly generate a keyframe image internally. If users want more options, we also provide the option for them to view this image first and then expand it to a video based on this image. This provides more possibilities for users.

Synced: As mentioned above, how might the performance of different stages (1.0, 2.0, 3.0) of text-to-video be?

Dr. Meitao: The 1.0 version of text-to-video mainly addresses the generation of single-shot videos of about 4 to 5 seconds.

Currently, the 1.0 version can achieve close to the industry standard for themes such as cartoons, animations, science fiction, and empty shots; but there are still some issues that have not been perfectly resolved, especially the continuity and detail processing of videos, such as the gestures and micro-expression changes of characters, and the interaction between multiple characters (such as scenes of making a child smile or two people shaking hands).

For the 2.0 version, our goal is to address the duration issue of single-shot videos, increasing the length of a single shot video from the current 4 seconds to around 7 seconds, or even longer. Generally, the length of a single shot in a video work is not particularly long, so achieving 7 seconds is quite mature. In the future, if longer single-shot videos are needed, we can also extend them in the time domain based on the last frame of the video.

The goal of the 3.0 version is to transition from single shots to handling multiple shots, being able to tell longer stories, such as a one-minute video, which may contain 10 to 20 or more shots. This requires solving various detailed issues, such as the consistency of IPs, the continuity of motion, and the handling of multiple camera positions.

For the 1.0 version, we can already do well in videos of cartoon, animation, and science fiction styles. However, for real-life videos, such as subtle facial expression changes and gesture continuity of characters in film and television works, more time is needed to overcome these challenges.

So, although the 1.0 version still faces many challenges in high-definition real-life videos, we will first try cartoon, science fiction, and empty shot styles. Currently, whether it's our product or other companies' products, we have not been able to do particularly well in high-definition real-life video production, and the success rate is still quite limited.

Synced: Where does the difficulty in generating real-life characters in videos manifest technically?

Dr. Meitao: It's not that there will be different technologies for generating real-life and cartoon characters, but in the generation of real-life characters, the advantages and disadvantages of the technology will be more obvious and the subtle differences are more likely to be noticed by everyone.

For example, even a slight change in a person's eyes or nose will appear unnatural. Details of fingers in real-life characters, such as the thickness or number of fingers, must also be precise and should not have issues like six fingers or distorted hands.

Synced: If we are still at the 1.0 stage, can you share specifically what are the key technical challenges that need to be overcome at this stage?

Dr. Meitao: The main challenge we face is to improve the accuracy of detail generation. Currently, we need to improve not only the resolution but also the overall visual quality of the generation, including the detail processing of characters and IPs.

Secondly, from our perspective, even if we encounter some unresolved challenges in the 1.0 stage, we can still strive to move towards the 2.0 and 3.0 versions. This involves converting text into a script, then dividing the script into multiple segments, and finally splicing them into a longer video, such as 15 to 20 seconds.

In an upcoming paper, we propose our approach: starting from a short prompt, a large language model automatically generates a script or screenplay. Based on each segment of the script, a keyframe is generated, and then each keyframe is converted into a short video, which is finally spliced into a complete long video.

The benefit of doing this is that we can better ensure the consistency of characters or scenery in the long video, fix the scenes inside to ensure the continuity of the content. We actually solve the transitional issues from 1.0 to 2.0 through such a solution.

But if I can make 1.0 better, that's certainly a good thing. If we can do the segments from images to videos better, then the entire long video will naturally be done well. In other words, doing each small part of this process better will lead to an improvement in the final result.

Synced: What is the core difficulty in this process?

Dr. Meitao: The core difficulty lies first in generating scripts. Our system can automatically generate a matching script from a prompt. Secondly, it lies in the transformation from the script to images, especially in maintaining the consistency of objects (such as an IP of a mouse) in different shots.

For example, inputting "mouse" may generate different types of mice, such as hamsters or monsters, but we need to maintain their consistency. Finally, in the transformation from images to videos, we need to ensure the stability of the IP and its consistency with the script.

Synced: In commercial applications, what are the differences between 4-second and 15-second video production?

Dr. Meitao: Technically, the 4-second duration is currently indeed a bottleneck. But when we discuss with our business partners, we all agree that 4 seconds is just too short and not practical. That's why we tried to extend it to 15 seconds or even longer.

But at the same time, we also realize that if the content does not change, even if the duration increases, it is meaningless to users. Our system, similar to an agent, can analyze scripts, arrange shots, and ensure the high consistency of IPs through language and visual models. This is not only a technological innovation but also a direct response to customer needs.

In specific commercial applications, the application space for 4 seconds and 15 seconds is very different. For example, in short videos and web series, the usual duration is from a dozen seconds to a minute and a half, which 4 seconds cannot satisfy.

This means extending the timeline to make the work closer to a script or short play. If it matures to a certain extent, we can even produce short plays that are several minutes long. The reason why everyone is so focused on videos is that the imaginative space provided by video generation is much larger than that of simple images.

Synced: In the development of video generation, what experience background does your team have?

Dr. Meitao: Our team has accumulated over a decade of experience in video production, which is an important advantage.

For example, the earliest work in academia on Caption-to-Video, "To Create What You Tell: Generating Videos from Captions," was a bold and forward-looking attempt by our team, published at the top multimedia conference ACM Multimedia 2017 Brave New Idea Track at the time.

We were one of the earliest research groups to enter the field of "vision and language." As early as 2016, we launched the "MSR-VTT" public evaluation dataset for "video and language." Today, there are already more than 500 institutions worldwide using this dataset, including top teams such as OpenAI and Google, and the related paper citations have exceeded 1,500.

We not only follow the traditional video production process but also combine our past experience in video generation with all current video AI technologies to achieve a perfect combination. In terms of innovation, while ideas are important, the real challenge lies in implementing these ideas. For example, at Microsoft, we often say "Idea is cheap," the key is whether you can implement it. So our goal is to be the first team to implement these ideas.

Synced: What are your expected goals and plans for video generation products in the future?

Dr. Meitao: Our Pionex has already supported a series of video generation features since September, including text-to-video, image-to-video, and video stylized editing. We will soon launch the generation of multi-shot videos of around 15 seconds.

In the long run, we will continue to address detail issues such as gestures, facial recognition, continuity, and consistency between shots. The product we are launching today is not perfect, but user usage and feedback will help us improve the product. Our product strategy is to get users started and then iterate and improve based on feedback.

Already achieved the goal of surpassing Stable Diffusion

Synced: You previously mentioned the goal of surpassing Midjourney for image generation products and surpassing Stable Diffusion for basic models. How is the completion of this goal now?

Dr. Meitao: Six months ago, we set this goal, and we achieved it in November. Our basic model surpassed the latest version of Stable Diffusion at the end of October. Our model currently has 10 billion parameters, making it the largest single model in the field of visual generative models.

Synced: What standards can be used to verify this conclusion?

Dr. Meitao: We recently conducted an evaluation. The Chinese University of Hong Kong tested a dataset containing 3200 prompts, divided into four types of prompts to test the generation capabilities of different models.

Our model leads in two out of three main indicators, surpassing Midjourney V5, DALL-E 3, and Stable Diffusion XL.

Evaluation link

In addition, we conducted an anonymous evaluation with 100 users and designers, combining objective and subjective evaluations, further confirming that our results are on par with those of Midjourney V5 and even surpass Midjourney V5 in many image categories.

Synced: In the process of pursuing this goal, what did you mainly do?

Dr. Meitao: In terms of the model, we made changes at two core levels:

First, we did not simply use CLIP for encoding, but based on CLIP, we trained our own framework for text encoding, which performs better than traditional CLIP.

Second, we comprehensively utilized both Latent Code and Pixel compression methods in the Diffusion Model, combining the comprehensiveness and efficiency of the former with the detail control of the latter, creating our own Diffusion Model.

In terms of data, we have achieved data reflow usage. The monthly active users of our consumer product Pionex are currently close to 20,000 and still rising. On the front end, users receive four images and two videos after each prompt input, and the user's selection data is also a form of feedback for us.

In addition, we have a dedicated team of designers to evaluate the generated results and clean the data, which has a significant impact on our model training and performance improvement.

Synced: You mentioned three most important things six months ago: iterating the model to the 10-billion level, surpassing MJ for text-to-image, and preliminary validation of the seed product in the small B group. How is the completion of the last item?

Dr. Meitao: Currently, we have developed two products, the tool "Pionex" for designers and the AI drawing tool "PixMaker" for e-commerce merchants.

In the design of the "Pionex" product, we not only achieve image generation but also provide image editing functions, allowing designers to intelligently expand, redraw, cut, and layout. We have recently launched the function of vector image generation and conversion.

In addition, we have established a community that allows image designers to share their work. We have formed a concept that integrates tools, content, and community. In the future, we also plan to develop more features to better meet the needs of designers.

Our basic model is iterated approximately every two weeks, with a major iteration every one to two months. In terms of functionality, we basically go live with small features every week, so the iteration speed of our product is very fast.

In terms of commercialization, in less than two months from October to November, our monthly active users are about to reach 20,000. Among them, the number of paying users has exceeded one thousand. This is very important for us. We need to know who the users are, why they come to us, and why they pay. Their payment behavior is crucial for us, especially for the consumer end.

"PixMaker" is an AIGC product for the e-commerce scene, which can support using AI to replace photography, batch generate high-quality product images, and replace models and scenes with one click. It can help merchants achieve a more than 5x increase in efficiency and more than 80% cost reduction without the risk of copyright infringement.

In this field, we have already signed contracts with more than a dozen cross-border e-commerce companies, which is quite fast. We hope to become a leading company in the field of AIGC tools for cross-border e-commerce images next year.

Synced: Similar to "PixMaker," Baidu and Alibaba also have similar tools. What are your differences and advantages?

Dr. Meitao: First, we are more focused on the visual aspect, while they may be more focused on language. The closed-source model we use is significantly superior to open-source models in terms of the fineness and realism of image generation. Our model is at least one to two generations ahead of open-source models.

In addition, our product performs better in terms of refinement, usability, controllability, and versatility. We can ensure highly accurate control of SKU product images for e-commerce, achieving 99% to 100% accuracy. Our model can also adapt to a large number of different SKU categories, demonstrating good versatility and reusability.

Customers usually use their real data to evaluate our product. Although the AIGC field is relatively new, customer needs may not be fully met at once, but we have been improving and optimizing in this field for about three months.

Now we have fully covered most categories, such as fast-moving consumer goods in cross-border e-commerce. Now we are working on improving the handling of more challenging product categories, such as clothing and wigs.

Synced: Can you give an example of how the fast-moving consumer goods industry in cross-border e-commerce uses your product?

Dr. Meitao: Of course. The fast-moving consumer goods industry in cross-border e-commerce has at least three requirements for our tools: universality, usability, and controllability.

In terms of universality, suppose a cross-border e-commerce business has 100,000 SKUs, the merchant wants our tool to cover as many product categories as possible, including both standard and non-standard products.

In terms of usability, for the task of generating product images for these 100,000 SKUs, the cost needs to be reduced. If the customer needs to select one image from 100 generated images for each product, the cost is clearly too high. On the other hand, if we can provide 80 usable images out of 100 generated images, it demonstrates high usability, as it significantly reduces costs and improves efficiency.

Lastly, controllability. For example, if the customer provides an image of a bottle of Evian water, the generated image cannot be of another brand. This means we must ensure precise matching of the products. For the e-commerce industry, these three aspects are complementary and crucial, whether in terms of cost, coverage, or usability.

Regarding specific applications, for example, an e-commerce customer provides a white background image of a SKU, which is their product. They may want us to generate marketing images or product main images based on this white background image.

They may want to specify certain scenes, such as on a beach, in a city street, or on a kitchen countertop. Our task is to display this SKU in any desired location based on the customer's input prompts. The challenge is that the generated image must remain consistent with the original product to prevent buyers from finding discrepancies between the product and the seller's description.

The second point is about the reasonable integration of images with backgrounds. The color, lighting, spatial layout, logic, and even visual aesthetics of the entire image should meet certain photographer standards, or at least be relatively close to this standard. Otherwise, the image may not stimulate the user's desire to purchase. Therefore, these aspects have certain requirements.

The same principles apply to clothing products. For example, a merchant often needs a regular salesperson or professional model to create finished images, such as models of different skin tones, appearances, and body types. Because they are engaged in cross-border e-commerce, they need to consider the sales location, which may involve different skin tones, such as Caucasian, African, and Asian populations.

In this case, we need to generate model images based on the user's input—showing the user's employees or models wearing specific clothing in different scenes such as beaches, city streets, or indoors, with different skin tones, ages, and expressions. Users can access our services or use our online signing solutions. This way, they can generate product main images for marketing or promotion.

Synced: By using this technology, how much have their costs been reduced?

Dr. Meitao: Costs have been reduced by about 10 times, or even more. The cost of traditional commercial photography is currently very high. For example, if shooting SKU setups in a typical studio, the cost of one image is about 30 to 50 RMB. If models are involved, the cost may reach several hundred RMB. By using our services, the cost has been reduced by more than 10 times.

If usability is improved, costs will be further reduced. Our goal is to reduce costs while also improving efficiency. For example, what used to require the work of 10 designers can now be accomplished by fewer designers, allowing them to handle more product categories.

Synced: What is the current level of controllability of the product?

Dr. Meitao: Our controllability over products, especially clothing products, can reach nearly 100% reliability. Although in some cases, such as group shots, there may be slight overlaps or minor differences, these differences are almost imperceptible to the user. Therefore, our controllability is close to 100%, although we occasionally encounter some particularly difficult cases.

Synced: What is the expected revenue for this area?

Dr. Meitao: Currently, we have signed contracts with more than a dozen clients, and there are about two dozen clients in the testing phase, and they are almost ready to sign contracts. Next year, we hope to attract one or two hundred e-commerce clients, mainly mid-sized and large clients. We also attach great importance to small clients because there are approximately hundreds of thousands of such brands nationwide. Our goal is to achieve tens of millions in revenue next year.

Synced: Regarding "Pionex," you mentioned six months ago that your strategy in the market competition is to integrate the text-to-image tool into the designer's workflow. What are your specific practices?

Dr. Meitao: Our target groups include professional designers and a wide range of design enthusiasts, including design enthusiasts, self-media personnel, students with design needs, and even some company personnel responsible for corporate promotion.

Our strategy is to understand and meet their needs through horizontal and vertical methods. Horizontally, we identify and meet the common needs of designers, such as intelligent editing, intelligent redraw, and layered layout. Vertically, we develop more in-depth product collections for specific industries (such as e-commerce), including automatic cutouts, image compositing, lighting effects processing, vector image conversion, and deterministic IP requirements. This way, we not only meet broad basic needs but also delve into the professional needs of specific industries.

Synced: So, between "Pionex" and "PixMaker," which one is the focus in terms of commercialization?

Dr. Meitao: Our main goal is to serve consumer users, namely the wide range of design enthusiasts. Of course, we will continue to focus on the business-to-business market. Therefore, our focus is definitely on providing tools for meeting the image output needs of designers, or tools for meeting video and image editing needs.

In terms of commercialization, we use a subscription model, where users can choose to pay monthly or annually. In China, we will focus on e-commerce, education, and other tracks, and develop together through ecological cooperation. In addition, we will not be limited to the domestic market; we plan to expand our consumer-facing business overseas.

Synced: What are the specific plans for going global?

Dr. Meitao: We are still exploring and experimenting with the process of going global. We will start by promoting our products in foreign communities and then follow cross-border e-commerce to go global.

Synced: How do you see the differences in competition between the domestic and foreign markets?

Dr. Meitao: According to our recent exchanges with some local entrepreneurial teams in Silicon Valley, the level of competition in foreign markets is much better than in the domestic market. The foreign market has more space, is more globalized, and users are more willing to pay. In addition, the foreign market performs better in terms of commercial homogeneity and ecological development.

Synced: If you go global, how do you view direct competition with MJ?

Dr. Meitao: I believe we are more advanced than Midjourney. We place more emphasis on interaction design and aim to achieve a deeper, progressive user interaction experience. We have our own independent platform and plan to combine large language models for better interactive design in image and text understanding.

Synced: What is the expected scale of the consumer-facing business?

Dr. Meitao: According to our estimates, there are approximately one billion core designers worldwide, with about 20 million in China, and over 80% of them need visual design capabilities. This does not include the wide range of design enthusiasts, such as self-media personnel, enterprise HR, and small e-commerce store owners, who also have practical needs, so the overall market size will be even larger.

Dr. Meitao: Currently, the investment in large language models and AICoin has basically come to an end. Commercialization may be slightly lagging because the actual implementation of large language models depends on applications.

As of the current situation, there has not been a large-scale breakthrough in large language models this year. Although many companies are developing language models and AICoin, their commercial scale is still relatively small and has not met expectations.

The commercial explosion requires a process. We expect the commercialization and application explosion of large language models to occur next year. For AICoin, we also expect a large-scale commercialization or explosion in small areas next year.

Synced: What might the scale of this explosion look like?

Dr. Meitao: It's difficult to define this scale because it involves some core data. Publicly available data currently shows that even in leading enterprises, the usage of large language models may not be particularly high.

Companies like Microsoft and OpenAI may still have considerable usage. But even their market revenue scale is not particularly high in the overall North American market. So, I think the explosion might mean a usage level similar to the search volume. For example, if the daily search volume is 1 billion, then the usage of AI-generated content (AIGC) should not be significantly lower than this level.

Synced: What might the commercial explosion look like?

Dr. Meitao: The commercial explosion might mean significant breakthroughs in various fields.

The commercialization of large language models ultimately needs to be adjusted for different fields because each field has different requirements. If there are major innovations in each field that drive the development of underlying large models, it will be a prosperous phenomenon.

Currently, we see various companies constantly releasing different models, but we have not yet seen these models reflected in actual applications. Therefore, the entire ecosystem still needs time to build.

Synced: What might be the reasons for the lack of large-scale applications?

Dr. Meitao: Currently, large models are actually difficult to apply, and their actual implementation costs are very high. For example, when people use large models, it's like a child wanting to climb a high mountain or enter a large forest without finding a clear path.

Therefore, there needs to be a bridge between humans and AGI or large language models, which could be an application or possibly an AI Agent in the future. Without such an intermediary layer, our language models and visual models cannot fully function or create value. So, building this layer is a task that requires the efforts of many people.

Synced: You mentioned the field of AICoin, and that many companies have not met expectations in terms of commercialization this year. Why is that?

Dr. Meitao: I believe there is currently no definitive answer to the issue of large-scale commercialization in the AICoin field. This is mainly because visual content generation still faces challenges such as uncertainty, uncontrollability, and the refinement of detail processing.

One of the reasons for the low usage at present is that the tools are not yet perfect. Users at the application level need more diverse tools to help them better realize their creativity, such as writing effective prompts.

The current AIGC tools can meet users' creative generation needs, but this only accounts for 10% to 20% of the entire designer supply chain. What really needs to be done is to delve into the downstream of the chain, including material collection, editing and refinement, and final product delivery, which accounts for 70% to 80% of the design process.

Currently, AIGC cannot delve into these aspects, but this does not mean it cannot do so in the future. We need to design better interactions and innovative gameplay to truly integrate AIGC into the workflow, bringing convenience to users, which will be a huge development space in the future.

Synced: What is the expected maturity timeline for AICoin video technology?

Dr. Meitao: Personally, I think there may be some excellent results by the end of 2024. Because we may already be in the early stages of a technological breakthrough, I believe 2024 will be an important year. By then, our technological investment, especially in the video aspect, will increase.

Synced: What are the overall strategic goals going forward?

Dr. Meitao: Next year, we will focus on video. In terms of commercialization, we have three strategic goals: the first is to establish tools and a community for consumer designers and make some breakthroughs in monthly active users and paying users.

The second goal is to become a leader in the B2B e-commerce field and a leading provider of cross-border e-commerce AIGC tools.

The third goal is innovation, especially in video. Currently, to be honest, no company in the world has shown particular maturity in video, and the technology is still iterating, and commercialization is still being explored.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Mei Tao: HiDream.ai's video generation has broken through the industry's 4-second bottleneck and can now support over 15 seconds.

Breaking the 4-second bottleneck in the industry, now able to support 15-second video generation

Already achieved the goal of surpassing Stable Diffusion

Selected Articles by 巴比特

Table of Contents

Related Articles