From AWS Outage to $19.3 Billion Liquidation Storm: The "Invisible Bomb" of Crypto Infrastructure

CN
PANews
Follow
10 hours ago

Original: YQ

Compiled by: Yuliya, PANews

On October 20, Amazon Web Services (AWS) experienced another major outage, severely impacting cryptocurrency infrastructure. The issues in the US-EAST-1 region (Northern Virginia data center) began around 4 PM Beijing time, causing outages for Coinbase and dozens of major crypto platforms, including Robinhood, Infura, Base, and Solana.

AWS has acknowledged an "increased error rate" in its core database and computing services—Amazon DynamoDB and EC2—which are relied upon by thousands of companies. This real-time disruption provides direct and vivid evidence for the core argument of this article: the dependence of crypto infrastructure on centralized cloud service providers creates systemic vulnerabilities that are repeatedly exposed under pressure.

The timing is highly cautionary. Just ten days after a $19.3 billion liquidation waterfall exposed failures at the exchange level, the AWS outage indicates that the problem has extended from a single platform to the underlying cloud infrastructure layer. When AWS fails, its ripple effects simultaneously impact centralized exchanges, "decentralized" platforms that still rely on centralized components, and countless services dependent on them.

This is not an isolated incident but a continuation of a long-term pattern. Similar AWS outages occurred in April 2025, December 2021, and March 2017, each leading to disruptions in mainstream crypto services. The question is no longer "if" it will happen again, but "when" and "what will trigger it."

Liquidation Waterfall from October 10 to 11, 2025

The liquidation chain event that occurred from October 10 to 11, 2025, serves as a typical case of infrastructure failure mechanisms. At 20:00 UTC on October 10 (04:00 Beijing time on October 11), a significant geopolitical announcement triggered widespread market sell-offs. Within just one hour, the liquidation scale reached $6 billion. By the time the Asian markets opened, the total evaporation of leveraged positions had reached $19.3 billion, affecting 1.6 million trader accounts.

Figure 1: Timeline of the October 2025 Liquidation Waterfall (UTC time)

Key turning points included API rate limiting, market maker exits, and a sharp decline in order book liquidity.

  • 20:00-21:00: Initial shock—$6 billion liquidation (red zone)
  • 21:00-22:00: Peak liquidation—$4.2 billion, API begins rate limiting
  • 22:00-04:00: Continued deterioration—$9.1 billion, market depth extremely thin

Figure 2: Comparison of Historical Liquidation Events

The scale of this event surpasses any previous crypto market event by at least an order of magnitude. A vertical comparison shows the leap characteristics of this event:

  • March 2020 (during the pandemic): $1.2 billion
  • May 2021 (market crash): $1.6 billion
  • November 2022 (FTX collapse): $1.6 billion
  • October 2025: $19.3 billion, 16 times the previous record

However, the liquidation data is merely the surface. The more critical issue lies at the mechanism level: why can external market events trigger such specific failure modes? The answer reveals systemic weaknesses in the architecture of centralized exchanges and the design of blockchain protocols.

Off-chain Failures: Architectural Issues of Centralized Exchanges

Infrastructure Overload and Rate Limiting

Exchanges typically set rate limiting mechanisms for their APIs to prevent abuse and maintain stable server loads. Under normal conditions, these limits can prevent attacks and ensure smooth trading. However, during extreme volatility, when thousands of traders attempt to adjust their positions simultaneously, this mechanism becomes a bottleneck.

During this liquidation period, centralized exchanges (CEX) limited liquidation notifications to one order per second, while the system actually needed to process thousands of orders. As a result, information transparency plummeted, and users could not understand the severity of the chain liquidations in real-time. Third-party monitoring tools showed hundreds of liquidations per minute, while official data was much lower.

API rate limiting prevented traders from adjusting their positions during the most critical first hour. Connection requests timed out, orders failed, stop-loss instructions were not executed, and position data was delayed in updating—all of these turned market events into operational crises.

Traditional exchanges typically allocate resources for "normal load + safety redundancy," but the gap between normal load and extreme load is vast. Daily trading volumes are insufficient to predict peak demand under extreme pressure. During chain liquidations, trading volumes can surge by 100 times, and position query frequencies can even spike by 1000 times. Every user checking their accounts nearly paralyzed the system.

Figure 4.5: AWS Outage Event Impacting Crypto Services

While the automatic scaling of cloud infrastructure helps, it cannot respond instantly. Creating additional database replicas takes minutes, and generating new API gateway instances also requires several minutes. During this time, the margin system continues to mark positions based on distorted price data due to order book congestion.

Oracle Manipulation and Pricing Vulnerabilities

In the October liquidation event, a key design flaw in the margin system was exposed: some exchanges calculated collateral values based on internal spot prices rather than external oracle prices. Under normal market conditions, arbitrageurs can maintain price consistency across different exchanges, but this linkage mechanism fails under infrastructure pressure.

Figure 3: Oracle Manipulation Flowchart

The attack path can be divided into five stages:

  • Initial Sell-off: Applying $60 million of selling pressure on USDe
  • Price Manipulation: USDe plummeting from $1.00 to $0.65 on a single exchange
  • Oracle Failure: The margin system adopts the tampered internal price
  • Triggering the Chain: Collateral is undervalued, triggering forced liquidations
  • Amplification Effect: A total of $19.3 billion in liquidations (322 times amplification)

This attack exploited the mechanism by which Binance uses spot market prices to price wrapped synthetic collateral. When an attacker sold $60 million of USDe into a relatively illiquid order book, the spot price plummeted from $1.00 to $0.65. The margin system, configured to mark collateral based on spot prices, reduced the value of all positions collateralized by USDe by 35%. This triggered margin calls and forced liquidations of thousands of accounts.

These liquidations forced more sell orders into the same illiquid market, further driving down prices. The margin system observed these lower prices and marked down more positions. This feedback loop amplified the $60 million selling pressure by 322 times, ultimately leading to $19.3 billion in forced liquidations.

Figure 4: Liquidation Waterfall Feedback Loop

This feedback loop illustrates the self-reinforcing nature of the waterfall:

Price Drop → Trigger Liquidation → Forced Sell → Further Price Drop → [Repeat Cycle]

If there were a well-designed oracle system, this mechanism would not work. If Binance had used a time-weighted average price (TWAP) across multiple exchanges, instantaneous price manipulation would not affect collateral valuation. If they had used aggregated price information from Chainlink or other multi-source oracles, this attack would have also failed.

The recent wBETH incident also exposed similar issues. Wrapped Binance ETH (wBETH) was supposed to maintain a 1:1 exchange rate with ETH. However, during the waterfall, liquidity dried up, and the wBETH/ETH spot market experienced a 20% discount. The margin system correspondingly marked down wBETH collateral, triggering liquidations of positions that were actually fully collateralized by underlying ETH.

Automatic Deleveraging (ADL) Mechanism

When liquidations cannot be executed at current market prices, exchanges implement an automatic deleveraging (ADL) mechanism to socialize losses among profitable traders. ADL forcibly closes profitable positions at current prices to cover the losses of liquidated positions.

During the October waterfall, Binance executed ADL on multiple trading pairs. Traders holding profitable long positions found their trades forcibly closed, not due to their own risk management failures, but because other traders' positions became insolvent.

ADL reflects the underlying architectural choices of centralized derivatives trading: exchanges ensure they do not incur losses, and thus losses must be borne in one of the following ways:

  • Insurance Fund (capital reserved by the exchange to cover liquidation losses)
  • ADL (forced liquidation of profitable traders)
  • Socialized Losses (spreading losses among all users)

The size of the insurance fund relative to the open contracts determines the frequency of ADL occurrences. In October 2025, Binance's total insurance fund was approximately $2 billion. Relative to the $4 billion in open contracts for BTC, ETH, and BNB perpetual contracts, this provided 50% coverage. However, during the October waterfall, the total open contracts across all trading pairs exceeded $20 billion, and the insurance fund could not cover the losses.

After the October waterfall event, Binance announced that when the total open interest for BTC, ETH, and BNB USDT perpetual contracts falls below $4 billion, they will guarantee that ADL will not occur. While this policy enhances trust, it also exposes structural contradictions: if the exchange wants to completely avoid ADL, it must hold a larger insurance fund, which would occupy funds that could otherwise be profitably utilized.

On-chain Failures: Limitations of Blockchain Protocols

Figure 5: Major Network Outages - Duration Analysis

  • Solana (February 2024): 5 hours - Voting throughput bottleneck
  • Polygon (March 2024): 11 hours - Validator version mismatch
  • Optimism (June 2024): 2.5 hours - Sequencer overload (airdrop)
  • Solana (September 2024): 4.5 hours - Spam transaction attack
  • Arbitrum (December 2024): 1.5 hours - RPC provider failure

Solana: Consensus Bottleneck

Solana experienced multiple outages between 2024 and 2025. The outage in February 2024 lasted about 5 hours, while the September outage lasted 4-5 hours. These outages stemmed from similar root causes: the network was unable to handle transaction volumes during spam transaction attacks or extreme activity.

Solana's architecture is optimized for high throughput. Under ideal conditions, the network can process 3,000 to 5,000 transactions per second and achieve sub-second finality. This performance is several orders of magnitude higher than Ethereum. However, during stress events, this optimization creates vulnerabilities.

The September 2024 outage was caused by a flood of spam transactions overwhelming the validators' voting mechanism. Solana's validators must vote on blocks to reach consensus. Under normal operation, validators prioritize voting transactions to ensure the consensus process. However, previous protocols treated voting transactions equally to regular transactions in the fee market.

When the transaction memory pool (mempool) was filled with millions of spam transactions, validators struggled to broadcast voting transactions. Without sufficient votes, blocks could not be finalized. Without finalized blocks, the chain stopped producing blocks. Users' pending transactions were stuck in the mempool, and new transactions could not be submitted.

Third-party monitoring tool StatusGator recorded multiple service interruptions for Solana in 2024-2025, while Solana's official channels did not release formal statements. This created information asymmetry, preventing users from distinguishing between their own connection issues and network-wide problems. Although third-party services provided oversight, the platform itself should have a comprehensive status page to establish transparency.

Ethereum: Gas Fee Explosion

Ethereum experienced extreme gas fee surges during the DeFi boom in 2021. The transaction fee for simple transfers exceeded $100, while complex smart contract interactions reached as high as $500 to $1,000. This rendered the network nearly unusable for small transactions and gave rise to another attack vector: MEV (Maximum Extractable Value) extraction.

Figure 7: Transaction Costs Under Network Stress

  • Ethereum: $5 (normal) → $450 (congestion peak) - 90x increase
  • Arbitrum: $0.50 → $15 - 30x increase
  • Optimism: $0.30 → $12 - 40x increase

In a high gas fee environment, becoming a validator became a significant profit source. MEV refers to the additional profits that validators can earn by reordering, including, or excluding transactions. In this scenario, arbitrageurs raced to front-run trades on large DEXs, while liquidation bots competed to liquidate under-collateralized positions first. This competition intensified the gas fee bidding war, causing even lower-cost Layer 2 solutions to experience significant fee increases due to high demand. The high gas fee environment further amplified MEV profit opportunities, increasing both the frequency and scale of related activities.

During congestion, users hoping to ensure their transactions were included had to bid higher than MEV bots. This led to scenarios where transaction fees exceeded the value of the transactions themselves. Want to claim your $100 airdrop? Please pay $150 in gas fees. Need to add collateral to avoid liquidation? Compete with bots paying $500 for priority.

Ethereum's gas limit represents the total amount of computation that can be executed per block. During congestion, users bid for scarce block space. The fee market operates as designed: higher bidders are prioritized. However, this design makes the network increasingly expensive during peak usage times, precisely when users need access the most.

Layer 2: Sequencer Bottleneck

Layer 2 solutions attempt to address this issue by moving computation off-chain while inheriting Ethereum's security through periodic settlements. Optimism, Arbitrum, and other Rollups process thousands of transactions off-chain and then submit compressed proofs to Ethereum. This architecture successfully reduces the cost per transaction under normal operations.

However, Layer 2 solutions introduce new bottlenecks. In June 2024, when 250,000 addresses simultaneously claimed an airdrop, Optimism experienced an outage. The component responsible for ordering transactions before submission to Ethereum—the sequencer—was overwhelmed. Users were unable to submit transactions for several hours.

This outage revealed that moving computation off-chain does not eliminate the need for infrastructure. The sequencer must handle incoming transactions, order them, execute them, and generate fraud proofs or zero-knowledge proofs for Ethereum settlement. Under extreme traffic, the sequencer faces the same scalability challenges as independent blockchains.

Multiple RPC providers must remain available. If the primary provider fails, users should be able to seamlessly switch to a backup solution. During the Optimism outage, some RPC providers were still operational while others failed. Users whose wallets defaulted to failed providers could not interact with the chain, even if the chain itself remained alive.

AWS outages repeatedly reveal the risks of centralized infrastructure in the crypto ecosystem:

  • October 20, 2025: US-EAST-1 outage affected Coinbase, Venmo, Robinhood, Chime, etc. AWS acknowledged increased error rates in DynamoDB and EC2 services.
  • April 2025: Regional outages affected multiple exchanges, including Binance, KuCoin, and MEXC, causing simultaneous interruptions. Major exchanges' AWS-hosted components failed.
  • December 2021: US-EAST-1 outage caused Coinbase, Binance.US, and the "decentralized" exchange dYdX to go down for 8-9 hours, also affecting Amazon's own warehouses and mainstream streaming services.
  • March 2017: S3 (Simple Storage Service) outage prevented users from logging into Coinbase and GDAX for up to five hours, triggering widespread internet disruptions.

These exchanges host critical components on AWS infrastructure. When AWS experiences regional outages, multiple major exchanges and services become unavailable simultaneously. During outages—precisely when market volatility may require immediate action—users cannot access funds, execute trades, or modify positions.

Polygon: Consensus Version Mismatch

Polygon experienced an 11-hour outage in March 2024 due to validator version inconsistencies. This was the longest incident analyzed among major blockchain networks, highlighting the severity of consensus failures. The root of the problem lay in some validators running outdated software while others had upgraded to the new version. Due to differences in how the two versions computed state transitions, validators reached inconsistent conclusions about the correct state, leading to consensus failure.

The chain could not produce new blocks because validators could not agree on the validity of the blocks. This created a stalemate: validators running the old software rejected blocks from validators running the new software, while validators running the new software also rejected blocks from the old software.

The solution required coordinating validators to upgrade. However, coordinating upgrades during an outage takes time. Each validator operator must be contacted, the correct software version must be deployed, and they must restart their validators. In a decentralized network with hundreds of independent validators, this coordination can take hours or even days.

Hard forks typically use block height as a trigger. All validators complete upgrades before a specific block height to ensure simultaneous activation. However, this requires prior coordination. Incremental upgrades, where validators gradually adopt the new version, carry the risk of version mismatches like the one seen in Polygon's outage.

Architectural Trade-offs

Figure 6: The Blockchain Trilemma - Decentralization vs. Performance

The "Blockchain Trilemma" reflects the following systems:

  • Bitcoin: Highly decentralized, low performance
  • Ethereum: Highly decentralized, medium performance
  • Solana: Moderately decentralized, high performance
  • Binance (CEX): Lowest degree of decentralization, highest performance
  • Arbitrum/Optimism: Medium to high degree of decentralization, medium performance

Core Insight: No system can achieve maximum decentralization and highest performance simultaneously. Each design makes deliberate trade-offs for different use cases.

Centralized exchanges achieve low latency through architectural simplicity. The matching engine processes orders in microseconds, with state residing in a central database, avoiding the overhead introduced by consensus protocols. However, this simplicity also creates a single point of failure. When infrastructure is under pressure, cascading failures can propagate through tightly coupled systems.

Decentralized protocols distribute state among validators, eliminating single points of failure. High-throughput chains can maintain this characteristic during outages (funds are not lost, only temporarily inactive). However, reaching consensus among distributed validators introduces computational overhead. Validators must agree before state transitions are finalized. When validators run incompatible versions or face overwhelming traffic, the consensus process may temporarily halt.

Increasing replicas can enhance fault tolerance but also raises coordination costs. In Byzantine fault-tolerant systems, adding a validator increases communication overhead. High-throughput architectures minimize this overhead through optimized validator communication to achieve superior performance, but this also makes them vulnerable to certain attack patterns. In contrast, security-focused architectures prioritize validator diversity and robustness of consensus, limiting throughput at the base layer while maximizing resilience.

Layer 2 solutions attempt to provide both features through layered design. They inherit Ethereum's security properties through L1 settlements while providing high throughput through off-chain computation. However, they introduce new bottlenecks at the sequencer and RPC layers, indicating that architectural complexity creates new failure modes while solving some problems.

Scalability Remains a Fundamental Issue

These events reveal a recurring pattern: blockchain and transaction systems perform well under normal loads but often collapse under extreme pressure.

  • Solana can effectively handle daily traffic but crashed when transaction volume increased by 10,000%.
  • Ethereum's gas fees remained reasonable before the popularity of DeFi applications but surged significantly due to congestion afterward.
  • Optimism's infrastructure runs smoothly under normal conditions but encountered issues when 250,000 addresses simultaneously claimed an airdrop.
  • Binance's API functions normally during regular trading but becomes constrained during liquidation surges due to traffic spikes. Particularly during the October 2025 event, Binance's API rate limits and database connections were sufficient for regular operations, but when all traders adjusted their positions simultaneously during the liquidation surge, these limits became bottlenecks. Additionally, the forced liquidation mechanism designed to protect the exchange exacerbated the problem during the crisis, forcing a large number of users to become sellers at the worst possible moment.

Automatic scaling proves inadequate in the face of sudden spikes in load, as new servers take several minutes to come online. During this time, the margin system may generate incorrect price data for position marking based on an illiquid order book. By the time the new servers come online, the chain reaction of liquidations has already spread.

Overprovisioning to handle rare stress events increases daily operational costs, so exchanges typically optimize systems to handle typical loads and accept occasional failures as an economically rational choice. However, this choice shifts the cost of downtime onto users, causing them to face issues such as liquidations, transaction stalls, or inability to access funds during critical market fluctuations.

Infrastructure Improvements

Figure 8: Distribution of Infrastructure Failure Modes (2024-2025)

The main causes of infrastructure failures between 2024 and 2025 include:

  • Infrastructure overload: 35% (most common)
  • Network congestion: 20%
  • Consensus failure: 18%
  • Oracle manipulation: 12%
  • Validator issues: 10%
  • Smart contract vulnerabilities: 5%

Several architectural improvements can be made to reduce the frequency and severity of failures, but each comes with trade-offs:

1. Separate Pricing and Liquidation Systems

The October event was partly caused by binding margin settlements to spot market prices. Using wrapped asset exchange rates instead of spot prices could avoid wBETH valuation distortions. More broadly, key risk management systems should not rely on potentially manipulable market data. Implementing independent oracle systems, multi-source aggregation, and TWAP calculations can provide more reliable pricing.

2. Overprovisioning and Redundant Infrastructure

The AWS outage in April 2025 that affected Binance, KuCoin, and MEXC demonstrated the risks of centralized infrastructure dependence. Running critical components across multiple cloud providers increases operational complexity and costs but eliminates correlated failures. Layer 2 networks can maintain multiple RPC providers with automatic failover capabilities. While the additional overhead may seem wasteful during normal operations, it can prevent hours of downtime during peak demand periods.

3. Strengthen Stress Testing and Capacity Planning

The "runs well until it fails" pattern indicates insufficient stress testing. Simulating 100 times the normal load should become standard practice. Identifying bottlenecks during development is far less costly than discovering them during actual outages. However, real load testing remains challenging. Traffic patterns in production environments exhibit behaviors that synthetic tests cannot fully capture. User behavior during real crashes differs from that during testing.

The Path Forward

Blockchain systems have made significant technical advancements, but they still exhibit notable shortcomings in handling stress tests. Current systems rely on infrastructure designed for traditional business hours, while the crypto market operates globally and continuously, leading teams to urgently address issues during abnormal working hours, potentially resulting in significant user losses. Traditional markets pause trading under stress, while crypto markets only trigger circuit breakers. Whether this situation is a system feature or a flaw depends on different perspectives and positions.

Overprovisioning is a reliable solution to the problem but conflicts with economic incentives. The cost of maintaining excess capacity is high and only addresses rare events. Unless the costs of catastrophic failures are sufficiently high, the industry may not take proactive measures.

Regulatory pressure may become a driving force for change, such as requiring 99.9% uptime or limiting acceptable downtime. However, regulations are often enacted after disasters occur, as seen when Mt. Gox's collapse in 2014 prompted Japan to establish formal regulations for cryptocurrency exchanges. The chain reaction expected from the October 2025 events may trigger similar regulatory responses, though it remains uncertain whether these responses will specify outcomes (such as maximum acceptable downtime or maximum slippage during liquidations) or dictate implementation methods (such as specific oracle providers or circuit breaker thresholds).

The industry needs to prioritize system robustness over growth during bull markets. Downtime issues are often overlooked during market booms, but the next cycle's stress tests may expose new vulnerabilities. Whether the industry will learn from the October 2025 events or repeat past mistakes remains an open question. History shows that the industry typically discovers critical vulnerabilities through billions of dollars in failures rather than proactively improving systems. For blockchain systems to maintain reliability under pressure, a shift from prototype architectures to production-grade infrastructure is necessary, requiring not only financial support but also a balance between development speed and robustness.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink