The Tragedy of the Commons Series in Cryptocurrency: The Data Indexing Woes of Polymarket

This article focuses on one of the most "out-of-the-box" applications in the Ethereum ecosystem: Polymarket and its data indexing tools.

Written by: shew

Abstract

Welcome to the "Tragedy of the Commons" series in the GCC Research column.

In this series, we will focus on those "public goods" in the crypto world that are at critical junctures but are gradually losing their norms. They are the infrastructure of the entire ecosystem, yet often face issues of insufficient incentives, governance imbalances, and even gradual centralization. The ideals pursued by crypto technology and the redundancy stability in reality are undergoing severe tests in these corners.

In this issue, we focus on one of the most "out-of-the-box" applications in the Ethereum ecosystem: Polymarket and its data indexing tools. Especially since the beginning of this year, events surrounding Trump's election victory, the manipulation of oracle data regarding Ukraine's rare earth trades, and political bets on Zelensky's suit color have made Polymarket a focal point of public opinion. The scale of funds and market influence it carries makes these controversies impossible to ignore.

However, does this product representing "decentralized prediction markets" truly achieve decentralization in its key foundational module—data indexing? Why has public infrastructure like The Graph failed to fulfill the expected role? What form should a truly usable and sustainable data indexing public good take?

1. A Chain Reaction Triggered by the Downtime of a Centralized Data Platform

In July 2024, Goldsky experienced a six-hour downtime incident (Goldsky is a real-time blockchain data infrastructure platform for Web3 developers, providing indexing, subgraphs, and streaming data services to help quickly build data-driven decentralized applications), causing a significant portion of the Ethereum ecosystem projects to become paralyzed. For example, DeFi frontends could not display users' positions and balance data, and the prediction market Polymarket could not show correct data, making countless projects appear completely unusable from the perspective of front-end users.

This should not happen in the world of decentralized applications. After all, the original purpose of blockchain technology design is to eliminate single points of failure, right? The Goldsky incident exposed a disturbing fact: although the blockchain itself has achieved decentralization as much as possible, the infrastructure used by applications built on the chain often contains a large number of centralized services.

The reason lies in the fact that blockchain data indexing and retrieval belong to "non-excludable, non-rivalrous" digital public goods. Users often expect to use them for free or at very low rates, but they require continuous investment in high-intensity hardware, storage, bandwidth, and operational manpower. Without a sustainable profit model, a winner-takes-all centralized pattern emerges: as long as one service provider gains a first-mover advantage in speed and capital, developers tend to direct all query traffic to that service, thus re-establishing a single point of dependency. Public welfare projects like Gitcoin have repeatedly emphasized that "open-source infrastructure can create billions of dollars in value, but authors often cannot pay their mortgages with it."

This warns us that the decentralized world urgently needs to enrich the diversity of Web3 infrastructure through public product funding, redistribution, or community-driven initiatives; otherwise, centralization issues will arise. We call on DApp developers to build locally prioritized products and urge the tech community to consider scenarios where data retrieval services fail when designing DApps, ensuring that users can still interact with projects without data retrieval infrastructure.

2. Where Does the Data You See in DApps Come From?

To understand why incidents like Goldsky occur, we need to delve into the behind-the-scenes workings of DApps. For ordinary users, DApps typically consist of two parts: on-chain contracts and front-end pages. Most users have become accustomed to using tools like Etherscan to check on-chain transaction statuses and obtain necessary information from the front end while initiating transactions and interacting with contracts through the front end. But where does the data displayed on the user front end actually come from?

Indispensable Data Retrieval Services

Suppose the reader is building a lending protocol that needs to display users' positions and the margin and debt status of each position. A straightforward idea is for the front end to read this data directly from the chain. However, in practice, the lending protocol's contracts do not allow users' addresses to query position data; the contracts provide functions to query specific data using position IDs. Therefore, if we want to display users' position statuses on the front end, we need to retrieve all positions in the current system and then find those that belong to the current user. This is akin to asking someone to manually search through millions of pages of ledgers to find specific information—technically feasible but extremely slow and inefficient. In fact, it is very difficult for the front end to complete this retrieval process; even large DeFi projects relying on local nodes to execute data retrieval tasks directly on servers often require hours.

Thus, we must introduce infrastructure to accelerate data acquisition. Companies like Goldsky provide these data indexing services to users. The following image illustrates the types of data that data indexing services can provide for applications.

At this point, some readers may wonder about the decentralized data retrieval platform The Graph that seems to exist within the Ethereum ecosystem. What is its relationship with Goldsky, and why do many DeFi projects use Goldsky as a data provider instead of the more decentralized The Graph?

The Graph / Goldsky and the Relationship with SubGraph

To answer the above questions, we need to understand some technical concepts first.

SubGraph is a development framework that allows developers to write code to read and aggregate on-chain data and display this data on the front end using certain methods.
The Graph is an early decentralized data retrieval platform that developed the SubGraph framework written in AssemblyScript. Developers can use the subgraph framework to write programs that capture contract events and write these events into a database, after which users can read this data using GraphQL methods or directly using SQL code to read the database.
We generally refer to service providers running SubGraph as SubGraph operators. The Graph and Goldsky are actually both SubGraph hosts. Since SubGraph is just a development framework, the programs developed using it need to run on servers. We can see the following content in Goldsky's documentation:

Some readers may wonder why there are multiple operators for SubGraph?

This is because the SubGraph framework only specifies how data is read from blocks and written to databases.

It does not implement how data flows into the SubGraph program and where the final output results are written, which needs to be implemented by SubGraph operators themselves.

Generally speaking, SubGraph operators will modify nodes to achieve faster speeds, and different operators (such as The Graph and Goldsky) have different strategies and technical solutions.

The Graph currently uses the Firehouse technical solution, which allows The Graph to achieve faster data retrieval than in the past, while Goldsky has not open-sourced its core program for running SubGraph.

As mentioned above, The Graph is a decentralized data retrieval platform. Taking the Uniswap v3 subgraph as an example, we can see that there are many operators providing data retrieval for Uniswap v3. Therefore, we can also view The Graph as an integrated platform for SubGraph operators, where users can send their written SubGraph code to The Graph, and then some operators within The Graph can help users retrieve data.

Goldsky's Pricing Model

For centralized platforms like Goldsky, there is a simple pricing standard based on resource usage, which is the most common billing method for SaaS platforms on the internet, and most technical personnel are very familiar with this method. The following image shows Goldsky's price calculator:

The Graph's Pricing Model

The Graph has a completely different fee structure from conventional billing methods, which is related to the token economics of GRT. The following image shows the overall token economics of GRT:

Whenever a DApp or wallet makes a request to a Subgraph, the Query Fee paid is automatically split: 1% is burned, about 10% flows into the curation pool of that Subgraph (Curator/developer), and the remaining ≈ 89% is distributed to the Indexer and its Delegator providing computational power through an exponential rebate mechanism.
Indexers must first self-stake ≥ 100k GRT to go live; if they return erroneous data, they will be penalized (slashing). Delegators delegate GRT to Indexers and proportionally share the majority of the aforementioned 89%.
Curators (usually the developers) stake GRT on their Subgraph's bond curve through Signal; the higher the Signal, the more resources Indexers are attracted to allocate. Community experience suggests that self-funding 5k–10k GRT can ensure several Indexers take on the job. Meanwhile, curators can also receive that 10% royalty.

The Graph's Per-Request Query Fee:

In The Graph's backend, register an API KEY and use that API KEY to request data retrieved by operators within The Graph. This part of the request is charged based on the number of requests, and developers need to pre-load a portion of GRT tokens on the platform as the cost of API requests.

The Graph's Signal Staking Fee:

For Subgraph deployers, they need the help of operators within The Graph platform to retrieve data. According to the revenue distribution method mentioned above, they need to inform other participants that their query service is better and can earn more money, which requires staking GRT, similar to advertising and guaranteeing their own revenue to attract others.

During testing, developers can deploy Subgraphs to The Graph platform for free, at which point The Graph will assist users with some retrievals, providing a free quota for testing that cannot be used in a production environment. If developers believe the Subgraph runs well in The Graph's official testing environment, they can publish it to the public network for other operators to participate in retrieval. Developers cannot directly pay a specific operator for guaranteed retrieval; instead, they let multiple operators compete to provide services, avoiding the formation of single-point dependencies. This process requires using GRT tokens to curate their own Subgraph (also referred to as Signal operations), meaning developers stake a certain amount of GRT in their deployed Subgraph, but operators will only join the retrieval work once the staked GRT reaches a certain threshold (previously consulted data indicated 10,000 GRT).

Poor Charging Experience, Confounding Developers and Traditional Accountants

For most project developers, using The Graph is actually a relatively troublesome task. Purchasing GRT tokens is relatively easy for Web3 projects, but the process of curating an already deployed Subgraph and waiting for operators is quite inefficient. This stage has at least the following two issues:

The uncertainty of the amount of GRT to stake and the time required to attract operators. The author directly consulted The Graph's community ambassadors to determine the amount of GRT to stake when deploying Subgraphs in the past, but for most developers, this data is not easily obtainable. Additionally, after staking sufficient GRT, it also takes time for operators to engage in retrieval.
The complexity of cost accounting and bookkeeping. Since The Graph uses a token economics mechanism to design its charging standards, this complicates cost calculations for most developers. A more practical issue is that if a company needs to account for this expenditure, accountants may not understand the composition of these costs.

"Is Centralization Better?"

Clearly, for most developers, directly choosing Goldsky is a simpler option. The billing method is understandable to everyone, and as long as payment is made, it can almost be used immediately, significantly reducing uncertainty. This has led to a situation where blockchain data indexing and retrieval services rely on a single product.

Evidently, The Graph's complex GRT token economics has affected its widespread application. Token economics can be complex, but clearly, these complexities should not be exposed to users. For example, the curation staking mechanism of GRT should not be exposed to users; a better approach for The Graph would be to provide users with a simplified payment page.

The above criticism of The Graph is not just my personal opinion; well-known smart contract engineer and Sablier project founder Paul Razvan Berg also expressed this view in a tweet. The tweet mentioned that the user experience of publishing Subgraphs and GRT billing is extremely poor.

3. Some Existing Solutions

Regarding how to solve the single point of failure in data retrieval, the above text has already mentioned one point: developers can consider using The Graph services, although the process is more complex, requiring developers to buy GRT tokens for staking curation and paying API fees.

Currently, there are numerous data retrieval software in the EVM ecosystem. For specifics, refer to Dune's The State of EVM Indexing or rindexer's EVM Data Retrieval Software Summary. Another recent discussion can be found in this tweet.

This article will not discuss the specific reasons for Goldsky's issues, as according to the Goldsky report, Goldsky knows the specific reasons but is only prepared to disclose them to enterprise-level users. This means that no third party can currently know what kind of failure Goldsky experienced. Based on the report's content, it can be inferred that there may have been issues when writing the retrieved data into the database. In this brief report, Goldsky mentioned that the database was not accessible, and access was only obtained after collaborating with AWS.

In this section, we mainly introduce other solutions:

Ponder is a simple, developer-friendly, and easy-to-deploy data retrieval service software that developers can rent servers to deploy themselves.
Local-first is an interesting development concept that calls for developers to provide a good user experience even in the absence of a network. In the presence of blockchain, we can somewhat relax the limitations of local-first to ensure that users can have a good experience when they can connect to the blockchain.

Ponder

Why do I recommend using Ponder instead of other software? The specific reasons include the following points:

Ponder has no vendor dependency. Initially, Ponder was a project built by an individual developer, so compared to other enterprise-provided data retrieval software, Ponder only requires users to input the Ethereum RPC URL and PostgreSQL database link.
Ponder provides a good development experience. I have used Ponder for development multiple times in the past. Since Ponder is written in TypeScript and the core library mainly relies on viem, the development experience is excellent.
Ponder has higher performance.

Of course, there are some issues; Ponder is still in a rapid development phase, and developers may encounter situations where previous projects cannot run due to breaking changes in versions. Considering that this article is not a technical introductory piece, I will not further discuss the development details of Ponder. Readers with a technical background can read the documentation themselves.

A more interesting detail about Ponder is that it has also begun some commercialization, but Ponder's commercialization approach aligns very well with the "Isolation Theory" discussed in the previous article.

Here, we briefly introduce the "Isolation Theory." We believe that the public nature of public goods allows them to serve any number of users, so charging for public goods will lead to some users no longer using them, which means that social benefits are not maximized (economically described as "no longer Pareto optimal"). Theoretically, public goods can charge differentiated prices to each individual, but the cost of implementing differentiated pricing is likely to exceed the surplus generated by it. Therefore, the reason public goods are offered for free is not that they should be inherently free, but because any fixed fee collection will harm social benefits, and currently, there is no cheap way to implement differentiated pricing for everyone. The Isolation Theory proposes a method for pricing within public goods, which is to isolate a portion of a homogeneous group and charge fees to that group. First, the Isolation Theory does not prevent everyone from enjoying public goods for free, but it proposes a method to charge fees to certain groups.

Ponder employs a method similar to the Isolation Theory:

First, deploying Ponder still requires certain knowledge, as developers need to provide external dependencies such as RPC and databases during the deployment process.
At the same time, after deployment, developers need to continuously maintain the Ponder application, such as using a proxy system for load balancing to prevent data requests from affecting Ponder's retrieval of on-chain data in the background threads. These aspects can be somewhat complex for general developers.
Currently, Ponder has an automated deployment service in internal testing called Marble, where users only need to submit their code to the platform for automatic deployment.

Clearly, this is an application of the "Isolation Theory," where developers who are unwilling to maintain Ponder services themselves are isolated. These developers can pay to obtain simplified deployment of Ponder services. Of course, the emergence of the Marble platform does not affect other developers' ability to use the Ponder framework for free and self-host their deployments.

Ponder and Goldsky's Audience?

Ponder, as a public good with no vendor dependency, is more popular than other vendor-dependent data retrieval services when developing small projects.
Some developers operating large projects may not necessarily choose the Ponder framework, as large projects often require retrieval services to have sufficient performance, and service providers like Goldsky often provide adequate availability guarantees.

Both have some risk points. From the recent Goldsky incident, it is advisable for developers to maintain their own Ponder service to respond to potential third-party service outages. Additionally, when using Ponder, one may need to consider the validity of the data returned by RPC. Not long ago, Safe reported an incident where a retrieval failure occurred due to erroneous data returned by RPC, leading to a crash. Although there is no direct evidence linking the Goldsky incident to invalid RPC returns, I suspect Goldsky may have encountered a similar issue.

Local-First Development Concept

Local-first has been a topic of discussion over the past few years. In simple terms, Local-first requires software to have the following functionalities:

Offline operation
Cross-client collaboration

Most current discussions related to local-first technology involve CRDT (Conflict-free Replicated Data Type) technology. CRDT is a conflict-free data format that allows users to automatically merge conflicts when operating across multiple endpoints to maintain data integrity. A simple view is to consider CRDT as a data type with a simple consensus protocol that ensures data integrity and consistency in distributed scenarios.

However, in blockchain development, we can relax the aforementioned requirements for software under the Local-first concept. We only require that users maintain a minimum level of usability on the front end even when there is no backend indexing data provided by the project developers. At the same time, the requirement for cross-client collaboration in local-first has already been addressed by blockchain technology.

In the context of DApps, the local-first concept can be implemented as follows:

Cache key data: The front end should cache important user data, such as balances and holdings, so that even if the indexing service is unavailable, users can still see the last known state.
Degraded functionality design: When the backend indexing service is unavailable, the DApp can provide basic functionality. For example, when the data retrieval service is down, some data can be directly read from the blockchain using RPC, ensuring that users see the latest status of the available data.

This local-first DApp design concept can significantly enhance the resilience of applications, preventing them from becoming unusable after a data retrieval service crash. Without considering ease of use, the best local-first application would require users to run nodes locally and then use tools like TrueBlocks to retrieve data locally. For discussions on decentralized retrieval or local retrieval, refer to the tweet Literally no one cares about decentralized frontends and indexers.

Conclusion

The six-hour outage incident of Goldsky has sounded the alarm for the ecosystem. Although blockchain itself has decentralized and anti-single-point-failure characteristics, the application ecosystem built on it still heavily relies on centralized infrastructure services. This dependency brings systemic risks to the entire ecosystem.

This article briefly introduced why the well-known decentralized retrieval service The Graph is not widely used today, particularly discussing some complexities brought by GRT token economics. Finally, this article discussed how to build a more robust data retrieval infrastructure, encouraging developers to use Ponder's self-hosted data retrieval development framework as an emergency response option, while also introducing Ponder's good commercialization path. Lastly, the article discussed the local-first development concept, encouraging developers to build applications that can still function without data retrieval services.

Currently, many Web3 developers have realized the single point of failure issue in data retrieval services. GCC hopes more developers will pay attention to this infrastructure and attempt to build decentralized data retrieval services or design a framework that allows DApp front ends to operate even without data retrieval services.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。