Advanced Formal Verification of Zero-Knowledge Proofs: In-Depth Analysis of Two ZK Vulnerabilities

In this article, we will focus on the perspective of discovering vulnerabilities, analyze specific vulnerabilities discovered during the audit and verification process, and the experiences and lessons learned from them.

Author: CertiK

In previous articles, we discussed advanced formal verification of zero-knowledge proofs: how to verify a ZK instruction. By formally verifying each zkWasm instruction, we can fully verify the technical security and correctness of the entire zkWasm circuit. In this article, we will focus on the perspective of discovering vulnerabilities, analyze specific vulnerabilities discovered during the audit and verification process, and the experiences and lessons learned from them. For a general discussion of advanced formal verification of zero-knowledge proof (ZKP) blockchains, please refer to the article "Advanced Formal Verification of Zero-Knowledge Proof Blockchains".

Before discussing ZK vulnerabilities, let's first understand how CertiK conducts formal verification of ZK. For complex systems like ZK virtual machine (zkVM), the first step of formal verification (FV) is to clearly define the content to be verified and its properties. This requires a comprehensive review of the design, code implementation, and test setup of the ZK system. This process overlaps with conventional audits, but the difference is that after the review, the verification goals and properties need to be established. At CertiK, we refer to this as verification-oriented audit. Audit and verification work are usually integrated. For zkWasm, we conducted both an audit and formal verification.

What is a ZK Vulnerability?

The core feature of a zero-knowledge proof system is to allow short encrypted proofs of offline or private execution of computations (such as blockchain transactions) to be passed to zero-knowledge proof verifiers for checking and confirmation, to ensure that the computation has indeed occurred as claimed. In this regard, ZK vulnerabilities would allow hackers to submit forged ZK proofs for false transactions and have the ZK proof checker accept them.

For the zkVM prover, the ZK proof process involves running the program, generating the execution records for each step, and converting the data of the execution records into a set of numeric tables (this process is called "arithmeticization"). These numbers must satisfy a set of constraints (i.e., "circuits"), including relational equations between specific table cells, fixed constants, database lookup constraints between tables, and polynomial equations that need to be satisfied between adjacent table rows (i.e., "gates"). On-chain verification can confirm the existence of a table that satisfies all constraints while ensuring that specific numbers in the table are not revealed.

zkWasm arithmeticization table

The accuracy of each constraint is crucial. Any error in a constraint, whether weak or missing, could allow hackers to submit a misleading proof that seemingly represents a valid execution of a smart contract, but in reality, it is not. Compared to traditional VMs, the lack of transparency in zkVM transactions amplifies these vulnerabilities. In non-ZK chains, the computational details of transactions are publicly recorded on the blockchain; zkVM does not store these details on the chain. The lack of transparency makes it difficult to determine the specific circumstances of an attack, and even whether an attack has occurred.

The ZK circuit for executing zkVM instructions is extremely complex. For zkWasm, the implementation of its ZK circuit involves over 6,000 lines of Rust code and hundreds of constraints. This complexity often means that there may be multiple vulnerabilities waiting to be discovered.

zkWasm circuit architecture

Indeed, we discovered several such vulnerabilities through the audit and formal verification of zkWasm. Below, we will discuss two representative examples and discuss their differences.

Code Vulnerability: Load8 Data Injection Attack

The first vulnerability involves the Load8 instruction of zkWasm. In zkWasm, reading from the heap memory is done through a set of LoadN instructions, where N is the size of the data to be loaded. For example, Load64 should read 64 bits of data from the zkWasm memory address. Load8 should read 8 bits of data (i.e., one byte) from memory and pad it with 0 prefixes to create a 64-bit value. Internally, zkWasm represents memory as an array of 64-bit bytes, so it needs to "select" a part of the memory array. This is done using four intermediate variables (u16_cells), which together form the complete 64-bit load value.

The constraints for these LoadN instructions are defined as follows:

These constraints are divided into three cases: Load32, Load16, and Load8. Load64 has no constraints because the memory unit is exactly 64 bits. For the case of Load32, the code ensures that the high 4 bytes (32 bits) of the memory unit must be zero.

For the case of Load16, the high 6 bytes (48 bits) of the memory unit must be zero.

For the case of Load8, it should require the high 7 bytes (56 bits) of the memory unit to be zero. Unfortunately, this is not the case in the code.

As you can see, only the high 9 to 16 bits are restricted to zero. The other high 48 bits can still be any value and can still masquerade as "read from memory".

By exploiting this vulnerability, a hacker can tamper with a legitimate execution sequence of a ZK proof, causing the Load8 instruction to load these unexpected bytes, resulting in data corruption. Moreover, by carefully arranging surrounding code and data, false executions and transfers may be triggered, leading to data and asset theft. This forged transaction can pass the check of the zkWasm checker and be mistakenly recognized as a genuine transaction by the blockchain.

Fixing this vulnerability is actually quite simple.

This vulnerability represents a class of ZK vulnerabilities called "code vulnerabilities" because they originate from code writing and can be easily fixed with minor local code modifications. As you might agree, these vulnerabilities are relatively easier to spot.

Design Vulnerability: Fake Return Attack

Let's look at another vulnerability, this time related to the call and return of zkWasm. Call and return are basic VM instructions that allow a running context (e.g., a function) to call another and resume the execution of the caller's context after the callee completes its execution. Each call expects a return later. The dynamic data tracked by zkWasm during the lifecycle of calls and returns is called "call frames". Since zkWasm executes instructions sequentially, all call frames can be sorted based on their occurrence time during execution. Below is an example of call/return code running on zkWasm.

Users can call the buytoken() function to purchase tokens (possibly through payment or transfer of other valuable items). One of its core steps is to increase the token account balance by 1 by calling the addtoken() function. Since the ZK prover itself does not support the call frame data structure, an execution table (E-Table) and a jump table (J-Table) are needed to record and track the complete history of these call frames.

The above diagram illustrates the process of buytoken() calling addtoken() and returning to buy_token(). It can be seen that the token account balance has increased by 1. In the execution table, each running step occupies a row, including the current call frame number, the current context function name (for illustration purposes only), the current instruction number being executed in the function, and the current instruction stored in the table (for illustration purposes only). In the jump table, each call frame occupies a row, and it contains the caller frame number, the caller function context name (for illustration purposes only), and the next instruction position of the caller frame (so that the frame can return). In both tables, there is a column called jops, which tracks whether the current instruction is a call/return (in the execution table) and the total number of call/return instructions that occurred for that frame (in the jump table).

As expected, each call should have a corresponding return, and each frame should have only one call and one return. As shown in the diagram, the jops value for the first frame in the jump table is 2, corresponding to the first and third rows in the execution table, where the jops value is 1. Everything seems normal at the moment.

But there is actually a problem here: while a call and a return will make the jops count for the frame 2, two calls or two returns will also make the count 2. Having two calls or two returns for each frame may sound absurd, but it is important to remember that this is exactly what hackers are trying to do by breaking the expected behavior.

You might be a little excited now, but have we really found the problem?

The results show that two calls are not a problem because the constraints of the execution table and the jump table make it impossible to encode two calls into the same row of a frame, as each call generates a new frame number, i.e., the current call frame number plus 1.

However, the situation with two returns is not so fortunate: since no new frame is created upon return, hackers can indeed obtain a legitimate execution sequence of the execution table and the jump table and inject forged returns (and corresponding frames). For example, the previous example of buytoken() calling addtoken() in the execution table and the jump table can be tampered with by the hacker as follows:

The hacker injected two forged returns between the original call and return in the execution table and added a new forged frame in the jump table (the original return and the subsequent instruction's running step number in the execution table need to be increased by 4). Since the jops count for each row in the jump table is 2, the constraints are met, and the zkWasm proof checker will accept the "proof" of this forged execution sequence. From the diagram, it can be seen that the token account balance increased 3 times instead of 1. Therefore, the hacker can obtain 3 tokens for the price of 1 token.

There are multiple ways to address this issue. One obvious approach is to separately track calls and returns and ensure that each frame has exactly one call and one return.

You may have noticed that we have not shown a single line of code for this vulnerability so far. The main reason is that there is no problematic line of code; the code implementation fully complies with the table and constraint design. The problem lies in the design itself, and so does the solution.

You might think that this vulnerability should be obvious, but in reality, it is not. This is because there is a gap between "two calls or two returns will also make the jops count 2" and "actually two returns are possible". The latter requires a detailed and complete analysis of various constraints in the execution table and the jump table, making it difficult to conduct complete informal reasoning.

Comparison of the Two Vulnerabilities

The "Load8 Data Injection Vulnerability" and the "Fake Return Vulnerability" both have the potential to allow hackers to manipulate ZK proofs, create false transactions, deceive proof checkers, and engage in theft or hijacking. However, their nature and the way they were discovered are completely different.

The "Load8 Data Injection Vulnerability" was discovered during the audit of zkWasm. This was no easy task, as we had to review over 6,000 lines of Rust code and hundreds of constraints for zkWasm instructions. Despite this, the vulnerability was relatively easy to discover and confirm. Since this vulnerability was fixed before formal verification began, it was not encountered during the verification process. If this vulnerability had not been discovered during the audit, we could expect it to be discovered during the verification of the Load8 instruction.

The "Fake Return Vulnerability" was discovered during formal verification after the audit. One reason we did not discover it during the audit is that zkWasm has a mechanism similar to jops called "mops", which tracks the history of memory access instructions for each memory unit during zkWasm runtime. The mops count constraint is indeed correct, as it only tracks one type of memory instruction, i.e., memory write; and each memory unit's history data is immutable and only written once (mops count is 1). But even if we had noticed this potential vulnerability during the audit, we would still not be able to easily confirm or exclude every possible scenario without a rigorous formal reasoning of all relevant constraints, as there is actually no line of code that is incorrect.

In summary, these two vulnerabilities belong to "code vulnerabilities" and "design vulnerabilities" respectively. Code vulnerabilities are relatively straightforward, easier to discover (incorrect code), and easier to reason and confirm; design vulnerabilities can be very subtle, more difficult to discover (no "incorrect" code), and more difficult to reason and confirm.

Best Practices for Discovering ZK Vulnerabilities

Based on our experience in auditing and formally verifying zkVM and other ZK and non-ZK chains, here are some recommendations on how to best protect ZK systems.

Check the Code and Design

As mentioned earlier, both code and design in ZK systems may contain vulnerabilities. Both types of vulnerabilities can lead to the compromise of ZK systems, so they must be eliminated before the system goes live. One issue with ZK systems compared to non-ZK systems is that any attack is more difficult to expose and analyze because its computational details are not public or retained on the chain. Therefore, people may know that a hacker attack has occurred, but they cannot know how it happened at the technical level. This makes the cost of any ZK vulnerability very high. Accordingly, the value of ensuring the security of ZK systems in advance is also very high.

Conduct Audits and Formal Verification

The two vulnerabilities we have discussed here were discovered through audits and formal verification. Some may think that if formal verification is used, there is no need for an audit because all vulnerabilities will be discovered through formal verification. In reality, our recommendation is to do both. As explained at the beginning of this article, high-quality formal verification work begins with a thorough review, inspection, and informal reasoning of the code and design; and this work itself overlaps with an audit. Additionally, discovering and eliminating simpler vulnerabilities during the audit will make formal verification easier and more efficient.

If both an audit and formal verification are to be conducted for a ZK system, the best time to do both is simultaneously, so that auditors and formal verification engineers can collaborate efficiently (potentially discovering more vulnerabilities, as the object and target of formal verification require high-quality audit input).

If your ZK project has been audited (thumbs up) or audited multiple times (big thumbs up), our recommendation is to conduct formal verification of the circuit based on the previous audit results. Our experience in auditing and formal verification of zkVM and other ZK and non-ZK projects has repeatedly shown that verification often captures vulnerabilities that were overlooked in the audit. Due to the nature of ZKP, while ZK systems are expected to provide better blockchain security and scalability than non-ZK solutions, the criticality of their own security and correctness is far higher than that of traditional non-ZK systems. Therefore, the value of high-quality formal verification of ZK systems is also far higher than that of non-ZK systems.

Ensuring the Security of Circuits and Smart Contracts

ZK applications typically consist of two parts: circuits and smart contracts. For applications based on zkVM, there are generic zkVM circuits and smart contract applications. For applications not based on zkVM, there are application-specific ZK circuits and corresponding smart contracts deployed on the other end of the L1 chain or bridge. Based on zkVM non-zkVM

Our audit and formal verification work on zkWasm only involved zkWasm circuits. From the perspective of the overall security of ZK applications, it is also very important to audit and formally verify their smart contracts. After all, after investing a lot of effort to ensure the security of the circuits, it would be very regrettable if the application is ultimately compromised due to a lack of vigilance in the smart contracts.

There are two types of smart contracts that deserve special attention. The first type is smart contracts that directly handle ZK proofs. Although their scale may not be very large, their risk is very high. The second type is large and medium-sized smart contracts running on zkVM. We know that they can sometimes be very complex, and the most valuable among them should be audited and verified, especially because people cannot see their execution details on the chain. Fortunately, after years of development, formal verification of smart contracts is now practical and ready for suitable high-value targets.

Let's summarize the impact of formal verification (FV) on ZK systems and their components through the following explanations.

FV circuit + non-FV smart contract = non-FV zero-knowledge proof
non-FV circuit + FV smart contract = non-FV zero-knowledge proof
FV circuit + FV smart contract = FV zero-knowledge proof

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。