In-depth Analysis of Decentralized AI Inference: How to Achieve a Balance Between Security and Performance

CN
5 hours ago

Original Author: Anastasia Matveeva - Co-founder of Gonka Protocol

Table of Contents

  • True "Decentralization"
  • Blockchain and Inference Verification
  • How It Works
  • Trade-offs Between Security and Performance
  • Optimization Space

When we started building Gonka, we had a vision: what if anyone could run AI inference and get rewarded for it? What if we could harness all the unused computing power instead of relying on expensive centralized providers?

The current AI landscape is dominated by a few large cloud service providers: AWS, Azure, and Google Cloud control most of the global AI infrastructure. This centralization brings serious issues that many of us have experienced firsthand. A few companies controlling AI infrastructure means they can set arbitrary prices, censor unwanted applications, and create single points of failure. When OpenAI's API goes down, thousands of applications crash. When AWS experiences an outage, half the internet comes to a halt.

Even the "efficient" cutting-edge fields are not cheap. Anthropic has stated that training Claude 3.5 Sonnet cost "tens of millions of dollars," and Claude Sonnet 4 is now widely available, but Anthropic has not disclosed its training costs. Its CEO, Dario Amodei, previously predicted that the training costs for cutting-edge models would approach $1 billion, with the next wave of models reaching into the billions. Running inference on these models is also not cheap. For a moderately active application, a single LLM inference could cost hundreds to thousands of dollars per day.

Meanwhile, there is a vast amount of computing power in the world that is idle (or used in meaningless ways). Think of those Bitcoin miners consuming electricity to solve worthless hash puzzles or data centers running below capacity. What if this computing power could be used for something truly valuable, like AI inference?

A decentralized approach can pool computing power, lower capital barriers, and reduce single vendor bottlenecks. We no longer rely on a few large companies but can create a network where anyone with a GPU can participate and earn rewards by running AI inference.

We know that building a viable decentralized solution will be very complex. From consensus mechanisms to training protocols to resource allocation, there are countless parts that need to be coordinated. Today, I want to focus on one aspect: running inference for specific LLMs. How difficult is this?

What is True "Decentralization"

When we talk about decentralized AI inference, we refer to something very specific. It is not just about running AI models on multiple servers; it is about building a system that allows anyone to join, contribute computing power, and be rewarded for honest work.

A key requirement is that the system must be trustless. This means you do not have to trust any individual or company to run the system correctly. If you want strangers on the internet to run your AI model, you need cryptographic guarantees to ensure they are indeed doing what they claim to be doing (with at least a sufficiently high probability).

This trustless requirement brings some interesting implications. First, it means the system needs to be verifiable: you need to be able to prove that the model and parameters used to generate a given output are the same. This is particularly important for smart contracts that need to verify the legality of the AI responses they receive.

But there is a challenge: the more verification you add, the slower the entire system becomes, as the network's computing power is consumed in verification. If you fully trust everyone, there is no need to verify inference, and performance is nearly on par with centralized providers. But if you trust no one and must verify everything, the system becomes very slow and cannot compete with centralized solutions.

This is the core contradiction we have been working to solve: finding the right balance in the trade-off between security and performance.

Blockchain and Inference Verification

So, how do you actually verify that someone ran the correct model and parameters? Blockchain becomes an obvious choice—though it has its own challenges, it remains the most reliable way we know to create immutable event records.

The basic idea is quite straightforward. When someone runs inference, they need to provide proof that they used the correct model. This proof is recorded on the blockchain, creating a permanent, tamper-proof record that anyone can verify.

The problem is that blockchains are slow. Really slow. If we try to record every step of inference on-chain, the massive data volume would quickly cripple the network. This was a constraint that drove many decisions when designing the Gonka network.

When designing the network and thinking about distributed computing, there are various strategies to choose from. Should you shard the model across multiple nodes, or keep the complete model on a single node? The main constraints come from network bandwidth and blockchain speed. To make our solution feasible, we chose to fit a complete model to a single node, although this may change in the future. This does indeed raise the minimum requirements for joining the network, as each node needs sufficient computing power and memory to run the entire model. Nevertheless, a model can still be sharded across multiple GPUs belonging to the same node, giving us some flexibility within the single-node constraint. We use vLLM, which allows for custom tensor and pipeline parallel parameters for optimal performance.

How It Works

Thus, we agreed: each node hosts a complete model and runs full inference, avoiding the need for coordination across multiple machines during actual computation. The blockchain is only used for recording. We only record transactions and artifacts used for inference verification. Actual computation occurs off-chain.

We want the system to be decentralized, with no single central point directing inference requests to network nodes. In practice, each participant deploys at least two nodes: one network node and one or more inference (ML) nodes. The network node is responsible for communication (including a chain node connected to the blockchain and an API node managing user requests), while your ML node executes LLM inference.

When an inference request arrives at the network, it reaches one of the API nodes (acting as a "transport agent"), which randomly selects an "executor" (an ML node from a different participant). To save time and parallelize the blockchain record with the actual LLM computation, the transport agent (TA) first sends the input request to the executor and records the input on-chain while the executor's ML node is running inference. Once the computation is complete, the executor sends the output to the TA's API node, while its own chain node records a verification artifact on-chain. The TA's API node then returns the output to the client and also records it on-chain. Of course, these records still contribute to overall network bandwidth limitations.

As you can see, the blockchain records neither slow down the start of inference computation nor delay the time it takes to return the final result to the client. The verification of whether the inference was completed honestly occurs afterward, in parallel with other inferences. If the executor is found to be cheating, they will lose rewards for the entire epoch, and the client will be notified and refunded.

The final question is: what do the artifacts contain, and how often do we verify inference?

Trade-offs Between Security and Performance

The fundamental challenge is that security and performance are in conflict with each other.

If you want maximum security, you need to verify everything. But this is both slow and expensive. If you want maximum performance, you need to trust everyone. But this comes with risks and exposes you to various attacks.

After some trial and error and parameter tuning, we found a way to attempt to balance these two considerations. We must carefully adjust the amount of verification, the timing of verification, and how to make the verification process as efficient as possible. Too much verification, and the system becomes unusable; too little verification, and the system becomes insecure.

Keeping the system lightweight is crucial. We must maintain the artifacts' lightweight nature by storing the probabilities of the next k tokens. We use them to measure the likelihood that a given output was indeed generated by the claimed model and parameters, capturing any tampering behavior with sufficiently high confidence, such as using a smaller model or a quantized model. We will detail the specific implementation of the inference verification process in another article.

At the same time, how do we decide which inferences to verify and which not to? We chose a reputation-based approach. When a new participant joins the network, their reputation starts at 0, and their 100% of inferences will be verified by at least one participant. If issues are found, a consensus mechanism will ultimately determine whether your inference passes or lowers your reputation, potentially kicking you out of the network. As your reputation grows, the number of inferences that need verification decreases, eventually possibly randomly selecting 1% of inferences for verification. This dynamic approach allows us to maintain a relatively low percentage of overall verification while effectively capturing those attempting to cheat.

Participants receive rewards at the end of each epoch proportionate to their weight in the network. Tasks are also allocated based on weight, so rewards are expected to be proportional to weight and the amount of work completed. This means we do not need to immediately catch and punish fraudsters; it is sufficient to catch them within the current epoch before distributing rewards.

Economic incentives drive this trade-off as much as technical parameters. By making the cost of cheating high while honest participation is profitable, we can create a system where the rational choice is to participate honestly.

Optimization Space

After months of building and testing, we have established a system that combines the advantages of blockchain in recording and security while approaching the performance of centralized providers for single inference. The fundamental contradiction between security and performance is real; there is no perfect solution, only different trade-offs.

We believe that as the network expands, it has a real opportunity to compete with centralized providers while maintaining complete decentralized community control. There is still significant room for optimization as it develops. If you are interested in learning about this process, please visit our GitHub and documentation, join our Discord community for discussions, and personally join this network.

About Gonka.ai

Gonka is a decentralized network designed to provide efficient AI computing power, aiming to maximize the utilization of global GPU computing power to accomplish meaningful AI workloads. By eliminating centralized gatekeepers, Gonka offers developers and researchers access to permissionless computing resources while rewarding all participants with its native token GNK.

Gonka is incubated by the American AI developer Product Science Inc. The company was founded by industry veterans from Web 2, including former Snap Inc. core product director Liberman siblings, and successfully raised $18 million in 2023, with investors including OpenAI investor Coatue Management, Solana investor Slow Ventures, K 5, Insight, and Benchmark Partners. Early contributors to the project include well-known leaders in the Web 2-Web 3 space such as 6 blocks, Hard Yaka, Gcore, and Bitfury.

Official Website|Github|X|Discord|White Paper|Economic Model|User Manual

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink