Inference Chokepoint 2026

Mar 07, 2026

In the last week of February 2026, $1.5 billion moved into Nvidia alternatives in seven days.

None of it went into a foundation model or a new lab. It went into inference. All of it went into the problem of serving tokens faster, cheaper, and with less silicon than Nvidia’s GPU was designed to provide. MatX raised $500 million from Jane Street and Leopold Aschenbrenner’s Situational Awareness Capital. D-Matrix closed $275 million, backed by Microsoft’s venture fund M12. SambaNova announced a $350 million Series E alongside its SN50 chip. Taalas came out of stealth with over $200 million proposing something architecturally extreme: hardcode model weights directly into transistors at the manufacturing stage. Axelera raised $250 million, the largest investment ever in a European AI semiconductor company, for low-power, edge-first inference.

Five companies raised that money in seven days. The training-era chokepoint has moved.

The training era had a simple logic. Whoever built the biggest model won. GPUs were the only hardware that could run the workloads. CUDA was the only software stack with two decades of optimizations, libraries, and institutional memory embedded in every major research organization on the planet. Switching off Nvidia meant rewriting everything your engineers built. Nobody did it. The moat was the switching cost, not the chip.

The frontier models exist now. GPT-5.4, Claude Opus 4.6, Gemini 3.1, DeepSeek V4. The race is to serve a billion people at a cost that doesn’t bankrupt you.

Training is batch-parallel, forgiving of latency, designed to saturate compute. Inference is sequential, user-facing, punishing of delay. The GPU was built for the first problem. When you run inference on an H100, you use approximately 0.17% of its theoretical compute capacity. Not 17%. Zero-point-seventeen. The chip sits waiting for data to arrive from off-chip memory. That transit, High Bandwidth Memory fast but located off the die, is the bottleneck. More FLOPS don’t help. The constraint is the memory bus. As Vamshi and I write in Peak Inference: you bought a Ferrari. You’re stuck in school zone traffic.

Purpose-built inference chips go after that bottleneck directly. Groq’s Language Processing Units put memory on-chip: no transit, deterministic dataflow, inference an order of magnitude faster. Cerebras’s wafer-scale chips put the entire model on a single piece of silicon the size of a dinner plate, eliminating chip-to-chip communication latency entirely. SambaNova’s Reconfigurable Dataflow Units are built for agentic workloads, the scenario where one user action triggers dozens of sequential model calls, each waiting on the previous result.

Nvidia read the room before anyone else did.

On Christmas Eve 2025, Nvidia acquired Groq for $20 billion. Jonathan Ross, the engineer who designed Google’s first TPU, and the entire Groq hardware team moved to Nvidia. The most credible independent inference architecture was absorbed before it could be deployed against Nvidia at scale.

Two days later, OpenAI committed $10 billion to Cerebras over three years: 750 megawatts of compute, wafer-scale chips accelerating ChatGPT inference. OpenAI had been shopping for inference alternatives. Reuters reported it was seeking hardware for roughly 10% of future inference needs. The Nvidia-Groq deal closed that option. OpenAI moved to the next one. Nvidia simultaneously invested $30 billion in OpenAI.

OpenAI’s inference costs ran approximately $7 billion in 2025 against roughly $4 billion in revenue. A 10x improvement in inference efficiency is $6.3 billion in annual cost reduction. At that scale you don’t pick a winner. You fund every credible architecture and let the workloads decide. The company with more inference workload data than anyone else in the industry does not know which architecture wins. If it knew, it would bet on one. It is betting on three.

February 5 made the downstream consequences visible.

Claude Opus 4.6 and GPT-5.3 Codex released within twenty minutes of each other. $285 billion erased in a single session from software, legal-tech, and financial services. Goldman Sachs announced the same week it was adopting Claude for accounting and compliance.

This was not a better product replacing another at the same layer. The intelligence layer moved beneath the application layer. SaaS sold you a box to hold your workflows. The box is gone. The workflows remain.

When Goldman adopts something for accounting and compliance, the math already works. Running the workflow through a model costs less than the software plus the human augmenting it. A year earlier that math had not worked. The only thing that changed was inference pricing.

The model does not need to be magical. It just needs to be cheap enough.

Inference cost is the floor that determines whether replacing a human workflow with a model call makes economic sense. Every order-of-magnitude drop produces a new cohort of workflows that cross from not-yet-viable to now-viable. That floor has been dropping at roughly 10x per year since 2022. GPT-4 equivalent performance cost $20 per million tokens in late 2022. By January 2026 it costs $0.40.

There will be another February 5. Different companies, different workflows, same repricing. The SaaSpocalypse is not an event.

The hyperscalers are not waiting.

Microsoft launched Maia 200 in January: 100 billion transistors, 10 petaflops at 4-bit precision, deployed internally for Copilot and available to external developers via Azure. Microsoft is Nvidia’s largest cloud customer. Maia 200 is margin Microsoft doesn’t pay Nvidia, while keeping the Nvidia relationship alive for everything CUDA still does better. Google has TPUs. Amazon has Trainium and Inferentia. Keep Nvidia for the workloads CUDA wins. Build alternatives for everything else.

Lenovo launched three dedicated AI inference servers at CES in January, citing Deloitte’s estimate that inference had reached half of all AI compute in 2025, climbing to two-thirds in 2026. The inference capital wave is a lagging indicator of the deployment wave. The deployment wave has been running for two years.

The Wall Street Journal reported that Nvidia is building a dedicated inference processor integrating Groq’s LPU technology. OpenAI has early access. It will be the first customer.

OpenAI pursued inference alternatives to escape Nvidia GPU dependency. It found Groq. Nvidia acquired Groq. It found Cerebras. It then committed $10 billion to Cerebras. Now it is the launch customer for a Nvidia chip built on Groq’s architecture.

Whether this closes the window for the inference startups depends on one thing: does the chip actually deliver LPU-level performance, or GPU-plus? CUDA compatibility means the software stack doesn’t fragment. If the performance is genuinely there, the $1.5 billion raised in early 2026 is chasing a window that closes with this chip. If the chip is incrementally better but not architecturally different, the window stays open.

Nvidia has done this twice in six months.

Anthropic stated publicly in February that DeepSeek, Moonshot, and MiniMax ran industrial-scale distillation campaigns against Claude. 24,000 fraudulent accounts. 16 million exchanges extracted. Training smaller models on Claude’s outputs to transfer frontier capabilities into cheaper, faster models they control.

Distillation as a technique is legitimate. What DeepSeek ran was something else. That’s not research. But the structural implication doesn’t change based on who’s doing it: if frontier model capabilities transfer into models running at $0.23 per million tokens rather than $15, frontier model providers lose pricing power. Not gradually. As a function of how reliably the transfer works.

Deflationary for model revenue. Not for inference infrastructure demand. More tokens at lower prices means more total inference volume. The margin moves away from whoever trained the frontier model toward whoever runs the infrastructure serving it.

India launched its sovereign large language model in February. South Korea’s AI Basic Act took effect January 22. The EU’s AI gigafactory regulation authorizes national-scale inference infrastructure. None of them want to find out what happens when their national AI infrastructure runs on someone else’s chips. Distillation makes the alternative cheaper. You don’t need years of frontier model development. You need the outputs of someone else’s frontier model and enough inference hardware to run what you build from them.

Every major technology transition produces a chokepoint. Bandwidth in the internet era: Cisco. The OS in mobile: Apple and Google. Provisioned compute in cloud: Amazon. GPU compute in training: Nvidia.

The chokepoint in inference is token delivery at scale, at low latency, at a cost that makes the economics above it work. The $1.5 billion raised in a single week is capital betting that the chokepoint is still open. Nvidia’s acquisition of Groq is the incumbent trying to close it before those bets mature. The OpenAI-Cerebras deal is what hedging looks like when the stakes are measured in billions.

mtrajan blog

Discussion about this post

Ready for more?