IOSG: When reasoning becomes a scarce resource, who captures its value?

Author｜Frank Fu @ IOSG

The hole that David Cahn pointed out in 2023 was never filled on the training side. It was filled on the inference side, and the market only started to factor it into pricing in the last few weeks. With Nvidia restructuring its financial statements around "service tokens" and Cerebras' IPO being oversubscribed 20 times, the bottleneck debate is over. The real question has become: when inference becomes a scarce resource, where on the computing stack will value be deposited?

Following the GPU: From a $200 billion problem to a $600 billion problem

In 2023, Sequoia's David Cahn raised the "$200 billion problem" hanging over the entire AI infrastructure. For every $1 spent on a GPU, roughly another $1 is spent powering it in a data center; therefore, each year's GPU CapEx means these chips must ultimately generate approximately $200 billion in revenue to recoup that investment. Even with very generous assumptions about AI revenue, he still found a gap of over $125 billion between "investment" and "actual payments from end customers." The concern is clear: GPUs are being overbuilt ahead of actual demand.

A year later, the gap not only failed to narrow but widened. In his 2024 sequel, Cahn redefined it as the "$600 billion problem" as hyperscale manufacturers' CapEx expanded. The bearish logic converged into a familiar shape: over-construction leads to oversupply, and oversupply burns capital.

Both articles are essentially asking the same question: Who will fill this hole? The answer has never appeared on the "training" side of the ledger. It appears on the inference side, and the market has only started to factor it into pricing in the last few weeks.

Cerebras IPO and Inference Squeeze

Cerebras went public on Thursday. The IPO was oversubscribed 20 times, with the price nearly double Wednesday's final bid. The demand didn't stem from bets on the "next Nvidia killer," but rather from something simpler: the market is beginning to realize that the real bottleneck in AI is inference, not training.

Cerebras's core strength lies in a chip architecture that enables extremely fast inference. Not training, but inference. This is precisely what excites Wall Street. The inference market is recurring, expanding with usage. Every time Claude answers a question, every time an agent performs a task, it consumes computing power. Training happens only once, but inference never stops.

JP Morgan estimates the inference market to be 10 to 50 times the size of training. When machines begin to execute tasks assigned by other machines—an agentic expansion—the demand for inference no longer expands with the number of users, but rather with the computing power itself.

Nvidia Redraws Its Landscape: Deduction Makes Headlines

If Cerebras was a market awakening, then Nvidia's latest quarterly earnings report is confirmation from the top of the industry chain. In the latest earnings call, Jensen Huang made the unspoken message clear: AI demand is growing parabolically. The reason is simple: agile AI has arrived. Mainstream AI has transitioned from one-off reasoning to logical reasoning, and is now entering the agent stage where it can autonomously call upon tools and orchestrate tasks. Huang said, "Tokens are now profitable." In the AI era, computing power equals revenue and profit.

This has reshaped the entire industry. Training is a one-time cost of building a model, while inference is a recurring cost of running it. Now, the bottleneck is inference, not training.

Nvidia has incorporated this assessment into its financial statements. It now discloses this across two platforms, rather than one: Data Center and Edge Computing. Data Center (approximately $75 billion in the quarter, up 92% year-over-year) is further broken down into Hyperscale ($38 billion, up 12% quarter-over-quarter) and ACIE, namely AI Cloud, Industrial & Enterprise ($37 billion, up 31% quarter-over-quarter). A new line is Edge Computing: $6.4 billion, up 29% year-over-year, covering the endpoints where adolescent and physical AI actually run, such as PCs, workstations, AI-RAN base stations, robots, and automobiles.

Edge computing currently accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside data centers. This signals that inference is splitting into two fronts: cloud inference in data centers and endpoint inference at the edge, enabling AI to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, shipping from Q3, boasts inference throughput up to 35 times that of Blackwell; Huang has also set a new $200 billion TAM for Vera CPUs designed for academia. Every leading-edge modeling company is expected to fully embrace it from day one.

The debate over bottlenecks was essentially settled when the world's most valuable companies restructured their financial disclosures around "service tokens." The remainder of this article discusses who captures value when inference (rather than training) becomes a scarce resource.

First, let's clarify the scope. In this discussion, we're focusing on cloud inference, specifically rented data center GPUs that provide API token services. Endpoint inference runs on the device's internal local chip (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU leasing and aggregation stack. Here, consider this a tailwind amplifying the overall inference economy and supporting the bottleneck argument, rather than referring to the market where Hyperbolic and Venice operate, which are entirely on the cloud side.

The squeeze has arrived.

Anthropic is like a canary in a coal mine. Its usage far exceeds its pre-configured capacity, and complaints about Claude being "lobectomized" flood the internet, including rate-limited responses, slowed inference, and compressed context windows. The solution is raw computing power: In May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220,000 Nvidia GPUs and over 300 megawatts, dedicating it specifically to inference, not training.

This unlocking of capacity triggered a series of limit changes, each a signal. On May 6th, Anthropic doubled the five-hour limit for Claude Code, removed peak-hour rate limiting, and significantly increased the API rate limit for Opus. On May 13th, it further increased the weekly limit for Claude Code by 50% (until July 13th). Then, starting June 15th, it did the opposite of being "generous": it removed agency and programmatic usage (Agent SDK, headless claude -p, CI pipeline) from the flat subscription and placed them in a separately metered credit pool ($20 to $200 per month, billed at API price). This final step condensed the entire argument into one action: the agent consumes inference far faster than the flat subscription design can handle, therefore it must be priced according to its original "recurring cost".

Training is a one-time capital expenditure. Inference, on the other hand, is a recurring operating cost that accumulates with interest as each new user and each new agent is acquired.

This stack has six layers and one bottleneck.

Every AI application sits on a supply chain that starts at the TSMC wafer fab and ends at the API endpoint:

Most companies own only one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.

There was only one exception.

Hyperbolic: The only company spanning three floors

Hyperbolic launched its on-demand GPU marketplace in June 2025. Within its first few months, it had over 200,000 developers, with adoption spanning cutting-edge AI labs, search, and major consumer platforms.

What's interesting is its architecture.

Hyperbolic doesn't own a single GPU. Every card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller carriers with idle capacity. This might sound like a weakness, but it's actually a moat.

By sitting between GPU suppliers and consumers, Hyperbolic can see real-time data that others cannot. It knows who is buying which GPUs at what price and when. It sees the oversupply before it becomes public knowledge and the surge in demand before it hits the market.

Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized, unified pool, allowing developers to rent the cheapest available GPUs anywhere without negotiating with each operator or managing a bunch of accounts. The more clouds it connects to, the deeper the liquidity and the richer the pricing data. Going forward, the team is exploring how to use this data to model GPU price curves and ultimately invest its own capital to smooth supply and demand, acting as a market maker for physical computing power; but this goal is still in its early stages, and what truly compoundes the current situation is the aggregation layer.

This is the flywheel:

Connect to more clouds → More aggregated supply
More supply → Deeper market and real-time pricing data
Better data leads to smarter routing in the present and, in the long term, pricing models.
Better liquidity and pricing → More developers → More cloud computing resources

No other company is trying this. Hyperbolic is the only company that spans the GPU leasing layer, deployment layer, and model API layer simultaneously.

Venice, the mirror

Venice is the clearest embodiment of the inference economy at the application layer and a useful contrast to Hyperbolic's position. It's a privacy-first inference application: an OpenAI-compatible API, plus consumer subscriptions (Free/Pro/Pro+/Max), routes requests to approximately 75 models, about two-thirds of which are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), with the remainder being anonymous pass-throughs of closed-source cutting-edge models. Crucially, Venice itself doesn't own meaningful computing power. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays cutting-edge labs for pass-throughs, so its real cost of revenue is inference computing power, not SaaS hosting.

Venice is truly selling privacy. This "privacy" doesn't mean turning public computing power into private property, but rather wrapping commoditized inference with a layer of protection: no data retention, no use for training, requests for anonymization, and part of the workload even runs within a TEE (Trusted Execution Environment), making it invisible even to the operators themselves. The underlying computing power is readily available; the added cost is this privacy layer. Moreover, this protection is layered and not homogeneous: for open-source models running on their own controlled or TEE GPUs, near-end-to-end confidential computation is achieved; however, for anonymous pass-through of closed-source models like Claude and GPT, privacy merely strips away the identity; the cutting-edge labs are still processing your original prompt. Therefore, the strongest privacy only covers the open-source portion; the cutting-edge model portion is "anonymous," not "truly confidential." Venice's gross profit = subscription price - inference costs paid to downstream users, and the extra revenue it generates compared to the bare API price is almost entirely supported by this privacy premium. This is why its profit margin is low and it's constrained by the pricing of cutting-edge pass-through.

The token design encapsulates this inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 of computing power per day. Paid subscriptions trigger programmatic buybacks and burns of VVV (approximately $2/$5/$10 for Pro/Pro+/Max respectively), with emissions decreasing according to a fixed schedule: 6M → 5M → 4M VVV per month, decreasing to 3M on July 1st. Buybacks are real, but discretionary and still relatively small: approximately $103,000 was burned in April and May respectively, and is slowly climbing to approximately $110,000 in June, well below the $200,000 monthly threshold.

The fundamentals are healthier than the headline suggests. The publicly circulated figure of "$70 million ARR" is almost certainly a misinterpretation of subscription renewals as net new customer acquisition; a more defensible and observable range is closer to $6 million to $15 million ARR. Below this, the traction is real: approximately 136,000 cryptocurrency addresses, about 9.9 million website visits per month (about 330,000 per day), and new Pro subscriptions hovering around 1,400 per day. This is a real business, but a low-margin one, its economics constrained by the computing power it purchases.

This is precisely why Hyperbolic sits a level above it. If Venice is a gas station, Hyperbolic is an oil refinery. Venice buys computing power from the same constrained supply that everyone relies on; Hyperbolic aggregates and standardizes that fragmented supply, then sells it to Venice and all players like itself. As inference demand grows, value accumulates not only for the applications that consume computing power, but also for the layer that aggregates and routes computing power and captures the cost of revenue paid by these applications.

Why this matter is important at this moment

Nvidia restructured its finances around "service tokens." Cerebras' IPO proved the market understands inference is a bottleneck. Anthropic's efforts to improve capacity demonstrate this is a real problem. Agentic and physical AI will amplify demand by orders of magnitude, spanning both cloud and edge computing.

This also completes the loop of the "$600 billion problem" from another perspective. Cahn's bearish logic—over-construction followed by oversupply—is likely to be validated in the end. But oversupply is precisely the optimal market for asset-light aggregators: when GPU prices decline and supply is fragmented across dozens of clouds, the player who doesn't own any hardware and routes every workload to the cheapest available card will profit from the price difference, while operators holding constantly depreciating GPUs will bear the losses. Hyperbolic is going long on oversupply, not shorting it.

The company that ultimately wins will not be the one with the most GPUs, but the one that can tell you where the GPUs are located, at what price they are available, and route each workload to where it can run at the lowest cost.

Hyperbolic is building a company that doesn't own GPUs, operates purely in software, and has a three-layer structure, yet it aims to become the ultimate aggregation layer for inference computing power.