Tragedy of the Crypto Commons Series: The Tragedy of Polymarket’s Data Indexing

Author: shew

summary

Welcome to the GCC Research column's "Tragedy of the Crypto Commons" series.

In this series, we'll focus on critical public goods in the crypto world that are increasingly failing to meet norms. These are the fundamental infrastructure of the entire ecosystem, but they often face challenges such as insufficient incentives, unbalanced governance, and even increasing centralization. The ideals of crypto technology and the reality of redundancy and stability are facing severe tests in these corners.

This issue focuses on one of the most influential applications in the Ethereum ecosystem: Polymarket and its data indexing tool. This year, Polymarket has repeatedly been at the center of public opinion, particularly with events surrounding Trump's election victory, oracle manipulation in Ukraine's rare earth trade, and political betting on the color of Zelenskyy's suit. The sheer scale of capital and market influence it carries make these controversies particularly challenging.

However, is data indexing, a key building block of this product, a product that represents a "decentralized prediction market," truly decentralized? Why have public infrastructure like The Graph failed to fulfill their intended role? What form should a truly usable and sustainable data indexing public good take?

1. The chain reaction caused by the downtime of a centralized data platform

In July 2024, Goldsky experienced a six-hour outage (Goldsky is a real-time blockchain data infrastructure platform for Web3 developers, providing indexing, subgraph and streaming data services to help quickly build data-driven decentralized applications), which paralyzed a large part of the Ethereum ecosystem. For example, the DeFi front-end was unable to display users' positions and balance data, and the prediction market Polymarket was unable to display correct data. Countless projects seemed to be completely unusable to front-end users.

This shouldn't happen in the world of decentralized applications. After all, wasn't blockchain technology designed to eliminate single points of failure? The Goldsky incident exposed a disturbing truth: while blockchains themselves are as decentralized as possible, the infrastructure used by applications built on them often includes a large number of centralized services.

The reason for this is that blockchain data indexing and retrieval are non-exclusive, non-rivalrous digital public goods. Users often expect them to be free or extremely low-cost, but these services require continuous and intensive investment in hardware, storage, bandwidth, and operations personnel. In the absence of a sustainable profit model, a winner-takes-all centralized structure emerges: once a service provider gains a first-mover advantage in speed and capital, developers tend to direct all query traffic to that service, thus re-creating a single point of dependency. Public welfare projects like Gitcoin have repeatedly emphasized that "open source infrastructure can create billions of dollars in value, but authors often can't rely on it to pay their mortgages."

This serves as a stark reminder that the decentralized world urgently needs to diversify Web3 infrastructure through public goods funding, redistribution, or community-driven initiatives. Otherwise, centralization will inevitably arise. We urge DApp developers to build local-first products and the technical community to consider data retrieval service failures when designing DApps, ensuring that users can still interact with projects even without data retrieval infrastructure.

2. Where does the data you see in Dapp come from?

To understand why incidents like Goldsky occur, we need to delve deeper into the workings of DApps behind the scenes. To the average user, a DApp typically consists of just two parts: the on-chain contract and the front-end. Most users are accustomed to using tools like Etherscan to track on-chain transaction status, obtain necessary information on the front-end, and initiate transactions and interact with contracts. But where does this data displayed on the user's front-end actually come from?

Indispensable data retrieval service

Suppose you are building a lending protocol that needs to display a user's holdings, as well as the margin and debt status of each position. A naive idea is for the front-end to read this data directly from the chain. However, in practice, the lending protocol contract does not allow the user address to query position data. The contract provides a function to query the specific data of the position using the position ID. So if we want to display the user's position status on the front-end, we need to retrieve all the positions in the current system and then find which positions belong to the current user. This is like asking someone to manually search millions of pages of ledgers for specific information - technically feasible, but extremely slow and inefficient. In fact, it is difficult for the front-end to complete this retrieval process. Even if the retrieval task of large DeFi projects is performed directly on the server by local nodes, it often takes hours.

Therefore, we must introduce infrastructure to accelerate data acquisition. Companies like Goldsky provide these data indexing services. The following diagram illustrates the types of data indexing services can provide to applications.

At this point, some readers may be curious about the existence of a decentralized data retrieval platform called TheGraph within the Ethereum ecosystem. What is the connection between this platform and Goldsky? And why do a large number of DeFi projects use Goldsky as their data provider instead of the more decentralized TheGraph?

TheGraph/Goldsky's relationship with SubGraph

To answer the above questions, we need to understand some technical concepts first.

SubGraph is a development framework that developers can use to write code to read and summarize on-chain data, and use certain methods to read and display this data to the front end.
TheGraph is an earlier decentralized data retrieval platform that developed the SubGraph framework written in AssemblyScript. Developers can use the subgraph framework to write programs to capture contract events and write these contract events to the database. Users can then use the Graphql method to read this data or directly use SQL code to read the database.
We generally refer to service providers that run SubGraph as SubGraph operators. TheGraph and Goldsky are both SubGraph hosts. Because SubGraph is a development framework, applications developed with it must run on a server. We can see the following in the Goldsky documentation:

Some readers here may be curious why there are multiple operators in SubGraph?

This is because the SubGraph framework actually only stipulates how data is read from and written to the database within the block.

There is no implementation for how data flows into the SubGraph program and what kind of database the final output results are written to. These contents need to be implemented by the SubGraph operator himself.

Generally speaking, SubGraph operators will make node modifications to achieve faster speeds. Different operators (such as TheGraph, Goldsky) have different strategies and technical solutions.

TheGraph currently uses the Firehouse technology solution. After the introduction of this technology solution, TheGraph can achieve faster data retrieval than before. However, Goldsky has not openly disclosed the core program running its SubGraph.

As mentioned above, TheGraph is a decentralized data retrieval platform. Taking the Uniswap v3 subgraph as an example, we can see that there are a large number of operators providing data retrieval for Uniswap v3. Therefore, we can also regard TheGraph as an integration platform for SubGraph operators. Users can send their own SubGraph code to TheGraph, and then there are some operators inside TheGraph that can help users retrieve data.

Goldsky's Pricing Model

For centralized platforms like Goldsky, Goldsky has a simple billing system based on resource usage. This is the most common SaaS billing method on the internet, and most technical personnel are familiar with it. The figure below shows Goldsky's price calculator:

TheGraph’s Pricing Model

TheGraph has a completely different fee structure than conventional billing. This fee structure is related to the token economics of GRT. The following figure shows the overall token economics of GRT:

Whenever a DApp or wallet makes a request to a Subgraph, the query fee paid will be automatically split: 1% is burned, about 10% flows into the curation pool (Curator/developer) of the Subgraph, and the remaining ≈ 89% is paid to the Indexer and its Delegator that provide computing power according to the exponential rebate mechanism.
Indexers must stake ≥ 100k GRT before going online; returning incorrect data will result in slashing. Delegators delegate GRT to Indexers and receive a proportional share of the aforementioned 89%.
Curators (typically developers) use Signal to stake GRT on their own Subgraph's bonding curve. A higher Signal number attracts more Indexers to allocate resources. Community experience suggests raising 5,000–10,000 GRT to secure orders from several Indexers. Curators also receive 10% of the royalty.

TheGraph's per-query fees:

Register an API KEY in the TheGraph backend and use the API KEY to request data retrieved by operators in TheGraph. These requests are charged based on the number of requests. Developers need to deposit a portion of GRT tokens in the platform in advance as the cost of API requests.

TheGraph’s Signal staking fees:

For SubGraph deployers, they need the help of operators within the TheGraph platform to retrieve data. According to the profit distribution method mentioned above, they need to tell other participants that my query service is better and can get more money. They need to stake GRT, which is similar to advertising and guaranteeing themselves that there will be profits, so that people will come.

During testing, developers can deploy SubGraph to the TheGraph platform for free. TheGraph will then assist users with some searches, providing a free quota for testing purposes. This quota is not suitable for production use. If developers believe SubGraph performs well within TheGraph's official testing environment, they can publish it to the public network and wait for other operators to participate in the search. Developers cannot directly pay a single operator for guaranteed search access; instead, multiple operators compete to provide services, avoiding a single point of dependency. This process requires using GRT tokens to perform curation (also known as signaling) on their SubGraph. This involves developers staking a certain amount of GRT into their deployed SubGraph. Operators will only participate in the SubGraph search process when the staked GRT reaches a certain level (previously consulted data is 10,000 GRT).

Poor billing experience stumps developers and traditional accountants

For most project developers, using TheGraph is a relatively cumbersome process. While purchasing GRT tokens is relatively easy for Web3 projects, the process of curating deployed SubGraphs and waiting for operators is quite inefficient. This process presents at least two problems:

The uncertainty of the amount of GRT to be staked and the time required to attract operators. When I deployed SubGraph in the past, I directly consulted with TheGraph community ambassador to determine the amount of GRT to be staked. However, for most developers, this data is not easy to obtain. In addition, after staking enough GRT, it takes some time for operators to intervene and search.
Cost calculation and accounting complexity. Because TheGraph uses token economics to design its fee structure, cost calculation becomes complex for most developers. More practically, if a business were to account for this expense, the accountant might not understand the cost structure.

“Is it better to be good or centralized?”

Obviously, for most developers, it is simpler to directly choose Goldsky. The billing method is understandable to everyone, and as long as the payment is made, it can be used almost immediately, and the uncertainty is greatly reduced. This has also led to the situation of relying on a single product in blockchain data indexing and retrieval services.

Clearly, TheGraph's complex GRT token economics have hindered widespread adoption. While token economics can be complex, these complexities shouldn't be exposed to users. For example, GRT's curation and staking mechanisms shouldn't be exposed to users. A better approach would be to simply provide users with a simplified payment page.

The above disparagement of TheGraph is not my personal opinion. Paul Razvan Berg, a well-known smart contract engineer and founder of the Sablier project, also expressed this view in a tweet. The tweet mentioned that the user experience of launching SubGraph and GRT billing was extremely poor.

3. Some existing solutions

As for how to solve the single point of failure in data retrieval, we have actually mentioned it above. That is, developers can consider using TheGraph service, but the process will be more complicated. Developers need to buy GRT tokens for staking curation and paying API fees.

Currently, there are a lot of data retrieval software in the EVM ecosystem. For details, you can refer to The State of EVM Indexing written by Dune or the summary of EVM data retrieval software written by rindexer. For another more recent discussion, please refer to this tweet.

This article will not discuss the specific cause of Glodsky's issues. Glodsky currently knows the cause, but is only disclosing it to enterprise users. This means that no third party can currently determine the exact nature of Glodsky's failure. Based on the report, it can be inferred that there may have been an issue writing retrieved data to the database. In this brief report, Glodsky mentioned that the database was inaccessible and that access was only restored after working with AWS.

In this section, we mainly introduce other solutions:

Poder is a simple data retrieval service software with good development experience and easy deployment. Developers can rent servers and deploy them by themselves.
Local-first is an interesting development concept that encourages developers to provide a good user experience even when the network is not available. In the presence of a blockchain, we can relax the restrictions of local-first to some extent, ensuring that users can get a good experience when they can connect to the blockchain.

ponder

Why do I recommend using ponder instead of other software? The specific reasons include the following:

Ponder has no vendor dependencies. Initially, Ponder was a project built by individual developers, so compared to other data retrieval software provided by other companies, Ponder only requires users to fill in the Ethereum RPC URL and the postgres database link.
Ponder provides a good development experience. I have used Ponder for development many times in the past. Since Ponder is written in TypeScript and the core library mainly relies on Viem, the development experience is very good.
Ponder has higher performance

Of course, there are some issues. Ponder is still under rapid development, and developers may encounter situations where previous projects become inoperable due to breaking updates. Since this article isn't a technical introduction, we won't delve into the details of Ponder's development. Readers with a technical background are encouraged to consult the documentation.

A more interesting detail about Ponder is that it has also begun partial commercialization, but Ponder's commercialization path is very consistent with the "isolation theory" discussed in the previous article.

Here, we briefly introduce the "segregation theory." We argue that the public nature of public goods allows them to serve an unlimited number of users. Therefore, charging for public goods will cause some users to stop using them, and social benefits will not be maximized (this is described in economic terms as "no longer Pareto optimal"). In theory, public goods can be priced differently for each user, but the costs of differential pricing are likely to outweigh the benefits. Therefore, the reason public goods are free is not that they are inherently free, but rather that any fixed fee will harm social benefits, and there is currently no cheap way to implement differentiated pricing for each user. Segregation theory proposes a method for pricing within public goods: by isolating a homogeneous group of people and imposing a fee on them. While segregation theory does not prevent everyone from enjoying public goods for free, it does suggest a way to impose a fee on a subset of people.

Ponder uses a method similar to isolation theory:

First of all, the deployment of ponder still requires certain knowledge. Developers need to provide external dependencies such as RPC and database during the deployment process.
After deployment, developers need to continuously operate and maintain the Ponder application, such as using a proxy system for load balancing to prevent data requests from affecting Ponder's on-chain data retrieval in the background thread. This is a bit complicated for ordinary developers.
Currently, Ponder is in internal testing of the fully automatic deployment service marble. Users only need to deliver the code to the platform to achieve automatic deployment.

This is clearly an application of the "isolation theory" principle: developers who are unwilling to operate and maintain the Ponder service themselves are isolated, and these developers can obtain simplified Ponder deployment services for a fee. Of course, the emergence of the Marble platform has not prevented other developers from using the Ponder framework for free and self-hosting deployment.

The audience for ponder and Goldsky?

Ponder, a completely vendor-free public good, is more popular for developing small projects than other vendor-dependent data retrieval services.
Some developers operating large projects do not necessarily choose the ponder framework, because large projects often require retrieval services to have sufficient performance, and service providers such as Goldsky often provide sufficient availability guarantees.

Both have risks. As the recent Goldsky incident suggests, developers are advised to maintain their own ponder services to mitigate potential third-party service outages. Furthermore, when using ponder, the validity of RPC return data should be considered. Safe recently reported a crash of a search engine due to incorrect RPC return data. While there's no direct evidence linking the Goldsky incident to invalid RPC return data, I suspect Goldsky may have encountered a similar issue.

Local-first development philosophy

Local-first has been a hot topic for the past few years. Simply put, local-first requires software to have the following features:

Working offline
Cross-client collaboration

Currently, most discussions related to local-first technologies often touch upon CRDT (Conflict-free Replicated Data Type). CRDT is a conflict-free data format that allows users to automatically resolve conflicts and maintain data integrity when operating on multiple devices. A simple way to think of CRDT is to view it as a data type with a simple consensus protocol. In a distributed environment, CRDT can guarantee data integrity and consistency.

However, in blockchain development, we can relax the aforementioned software requirements for local-first. We only require that users maintain a minimum level of usability on the frontend even without the backend indexing data provided by the project developer. Furthermore, the local-first requirement for cross-client collaboration is already addressed by blockchain.

In the context of DApps, the local-first concept can be implemented as follows:

Cache key data: The front-end should cache important user data, such as balances and holdings, so that even if the indexing service is unavailable, users can still see the last known status.
Degradation function design: When the backend index service is unavailable, DApp can provide basic functions. For example, when the data retrieval service is unavailable, some data can be directly read on the chain using RPC, which can ensure that users can see the latest status of some existing data.

This local-first DApp design philosophy significantly improves application resiliency, preventing unavailability in the event of a data retrieval service crash. Regardless of usability, the best local-first applications require users to run a local node and then retrieve data locally using tools like trueblocks. For more discussion on decentralized or local retrieval, see the post "Literally no one cares about decentralized frontends and indexers."

4. Final Thoughts

The six-hour Goldsky outage sounded a wake-up call for the ecosystem. While blockchains inherently offer decentralization and resistance to single points of failure, the application ecosystems built on them remain highly dependent on centralized infrastructure services. This reliance poses systemic risks to the entire ecosystem.

This article briefly explains why TheGraph, a long-standing decentralized search service, is not widely used today, specifically discussing the complexities introduced by the GRT token economics. Finally, this article discusses how to build a more robust data search infrastructure. I encourage developers to use the Ponder self-hosted data search framework as an emergency response option and also outline a promising path to commercialization for Ponder. Finally, this article discusses the local-first development philosophy, encouraging developers to build applications that can function even without a data search service.

At present, many Web3 developers are aware of the single point of failure problem of data retrieval services. GCC hopes that more developers will pay attention to this infrastructure and try to build decentralized data retrieval services or design a framework that allows DApp front-ends to still run without data retrieval services.