Putting a large model into a phone, how many steps?

Image Source: Generated by Wujie AI

The era of AI has entered a new phase as large models are now making their way into mobile phones, extending the reach of AI from the cloud to mobile terminals.

"On entering the AI era, Huawei's PanGu large model will help boost the HarmonyOS ecosystem," said Yu Chengdong, Executive Director and CEO of Huawei's Terminal BG and Intelligent Automotive Solutions BU, on August 4. He explained that the underlying technology of the PanGu large model has enabled Harmony OS to bring the next-generation intelligent terminal operating system.

Using large models on mobile phones is not a new concept. Previously, apps and mini-programs such as ChatGPT, Wenxin Yiyuan, and Miaoya met the AI application needs of mobile terminals by calling cloud computing power.

The next step is to run large models directly on mobile phones.

Starting from April and May this year, the leading American tech giants - Qualcomm, Microsoft, NVIDIA, as well as the highly anticipated AI newcomer OpenAI, and domestic AI "headquarters" such as Tencent and Baidu, have all accelerated the deployment of lightweight AI large models on mobile terminals. Qualcomm even announced its gradual transformation into a company providing intelligent edge computing services at data sources such as mobile terminals.

Driven by the combined efforts of these tech giants, the industrial trend of large models transitioning from the cloud to the edge has become increasingly clear.

Why should large models "run" on mobile phones?

The most significant feature of large models is their size, often reaching billions, trillions, or even quadrillions of parameters. So why should these large models be "squeezed" into the palm-sized mobile phones?

Large models can indeed enhance the user experience on mobile phones. For example, Huawei's terminal intelligent assistant Xiaoyi can not only recommend restaurants based on voice prompts but also perform tasks such as summarization, information retrieval, and multilingual translation. With the capabilities of large models, a smartphone's intelligent assistant can generate summaries for lengthy English texts and translate them into Chinese. Especially in the era of information explosion, this is valuable for improving learning and work efficiency.

Jia Yongli, President of Huawei's Terminal BG AI and Intelligent Full-Scene Business Department, explained that large language models have the ability to generalize, helping smartphone intelligent assistants enhance their understanding. Additionally, the plugin capabilities of large models can break down barriers between various applications within the phone and expand their capabilities using tools.

Furthermore, AI applications such as ChatGPT have long been accompanied by strong privacy and security controversies. However, running these applications entirely on the edge can completely avoid this issue, as the data does not leave the edge. Moreover, the response speed will be faster.

On the other hand, the demand for large models on mobile and other edge devices has become increasingly urgent.

The surge of large models has made it increasingly difficult for the cloud alone to handle the demand for computing power. Qualcomm's Senior Vice President, Alex Katouzian, recently stated, "With the accelerated growth of connected devices and data traffic, coupled with the rising costs of data centers, (we) cannot send all content to the cloud."

In addition to the substantial costs of data transmission, network bandwidth, storage, and hardware, the current cloud computing power is already becoming challenging for relevant manufacturers. For example, the inference stage of ChatGPT alone is estimated to cost around $10 million per month in computing power.

The biggest issue is not just the cost but the scarcity.

Even OpenAI's founder, Sam Altman, admitted to facing a shortage of GPUs and even expressed a desire for fewer people to use ChatGPT. Recently, industry insiders have speculated that the large-scale H100 cluster capacity of small and large cloud providers is about to be exhausted, and the demand for H100 is expected to continue at least until the end of 2024. Currently, the production capacity of NVIDIA's H100 is severely constrained by the supply chain.

Therefore, the collaboration between the cloud and the edge, utilizing the idle computing resources of mobile terminals and other edge devices, has become a definite trend in reducing costs and increasing efficiency for the development of large models. More importantly, compared to the limited central nodes, the numerous mobile terminals serve as "capillaries" touching thousands of scenarios, making them a key entry point for accelerating the penetration of large models.

How to "fit" large models into pockets?

"Compared to traditional PCs or servers, the biggest challenge for mobile terminals is how to balance experience and energy consumption. This is one of the most important core points in the design of the HarmonyOS kernel," emphasized Gong Ti, President of Huawei's Terminal Business Software Department.

Large models require a significant amount of computing and storage resources, especially based on the current hardware configuration of mobile phones. This requires the software system to coordinate and improve efficiency while reducing energy consumption.

Currently, to enhance performance, mobile phones require at least 8 chip cores, which necessitates efficient coordination between the phone's systems, consuming a significant amount of computing power. By using heterogeneous resource scheduling, it is possible to efficiently coordinate the CPU, GPU, and NPU. Gong Ti stated that this scheduling efficiency can be increased by over 60%.

The smallest unit for computation and scheduling in a mobile phone system is called a thread. In traditional operating systems, there are often tens of thousands of threads running simultaneously, including a large number of invalid threads. To address this, a more lightweight concurrency model can be used to handle concurrent operations, reducing the consumption of computing power due to invalid thread switches. According to Gong Ti, a concurrency model can save 50% of the task switching overhead.

Furthermore, in terms of task scheduling in the operating system, this is also a fundamental element affecting smooth user experience. Dynamic priority scheduling, as opposed to fair scheduling, can significantly reduce energy consumption. Dynamic priority scheduling is similar to an intelligent traffic system that can dynamically adjust traffic signal lights based on road conditions and traffic flow, reducing congestion and delays when traffic flow increases in a certain direction.

However, simply upgrading and improving the mobile operating system is not enough to deploy large models on mobile phones.

As large models become more accurate and deeper, the memory capacity consumed by neural networks has become a core issue. Additionally, there are also concerns about memory bandwidth. During network operation, memory, CPU, and battery are all rapidly consumed, which is a significant burden for current mobile phones.

Therefore, before deploying large models on mobile phones, it is necessary to compress the large models to reduce the demand for inference computing power while ensuring that the original performance and accuracy remain largely unchanged.

Quantization is a common and important compression operation that can reduce the memory space occupied by the model and improve inference performance. Essentially, it involves converting a floating-point operation model into an integer operation model, as integer operations have higher precision and faster computation speed than floating-point operations.

Currently, quantization technology is also making rapid breakthroughs. Models trained on servers generally use 32-bit floating-point operations (FP32), but Qualcomm has already compressed FP32 models to INT4 models on mobile devices, achieving a 64-fold memory and computational efficiency improvement. Qualcomm's implementation data shows that with the help of Qualcomm's quantization-aware training, many AIGC models can be quantized to INT4 models, achieving a performance improvement of approximately 90% compared to INT8 and an energy efficiency improvement of approximately 60%.

Large model compression technology is undoubtedly a key factor for AI giants to win the battle in the mobile terminal arena. This also partly explains why NVIDIA "quietly" acquired the AI startup OmniML in February this year, which has expertise in compressing large models.

Large models driving terminal hardware upgrades

"This year, we will be able to support the operation of generative AI models with up to 10 billion parameters on mobile phones," said Ziad Asghar, Senior Vice President of Product Management and AI at Qualcomm, recently. Models with 10 billion to 15 billion parameters can cover the vast majority of AIGC use cases. If the terminal can already support this parameter level, all computations can be performed on the terminal, making the phone a true personal assistant.

However, the current next-generation flagship mobile chip can only handle models with up to 1 billion parameters, and Qualcomm's successful demonstration of a large model running on the Android system at the Computer Vision and Pattern Recognition Conference (CVPR) in June this year was only a 15 billion parameter model.

With a nearly tenfold increase in parameters, large models heading to mobile terminals have stepped on the "accelerator," forcing mobile phones to accelerate their upgrades to cope with this.

Mobile hardware urgently needs innovation in AI accelerators and memory.

Firstly, larger models require more memory and storage space to store model parameters and intermediate results. This necessitates an upgrade in the memory chip capacity and memory interface bandwidth of mobile terminals.

Secondly, larger parameters inevitably require more powerful computing and inference capabilities to process input data and output results.

Although AI accelerators on mobile chips (such as various NPU IPs) are almost standard, they are primarily designed for the previous generation of convolutional neural networks and are not entirely tailored for large models.

To adapt to large models, AI accelerators must have greater memory access bandwidth and reduced memory access latency. This requires some changes to the interface of AI accelerators (such as allocating more pins to the memory interface) and corresponding changes to on-chip data interconnects to meet the memory access needs of AI accelerators.

One of the important reasons why Qualcomm can claim to "run 10 billion parameters on mobile phones within the year" is its possession of the second-generation Snapdragon 8 processor, which features Qualcomm's fastest and most advanced AI engine to date. Compared to the first-generation Snapdragon 8 processor, the AI performance of the second generation has increased by 4.35 times, with a 60% improvement in energy efficiency.

However, even in the cloud, the training and inference of super-large-scale parameter models urgently need to break through five barriers: the memory wall, the computing power wall, the communication wall, the tuning wall, and the deployment wall, requiring mobile phones to break through layer by layer.

Nevertheless, for mobile phones, the transition from "smart" to "artificial intelligence" presents more opportunities than challenges.

"The impact of the innovation cycle on consumer electronics is more important and can even lead an industry out of the economic cycle," said Zhao Ming, CEO of Honor Terminal, judging that the current smartphone industry is in a new round of innovation driven by AI, 5G, and more.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Putting a large model into a phone, how many steps?

Why should large models "run" on mobile phones?

How to "fit" large models into pockets?

Large models driving terminal hardware upgrades

Selected Articles by 巴比特

Table of Contents

Related Articles