Inference Blindness

Chip-level optimization beats software tricks. Until it doesn't.

Mar 15, 2026

Your $30,000 GPU is running at 3% of its theoretical capacity. Because of physics you can’t optimize away.

First instinct has been to fix this with software. Flash Attention (fewer memory trips). KV cache compression (don’t reread the book). Speculative decoding (intern drafts, senior corrects). Continuous batching (don’t hold bus for batching). PagedAttention (no reserved parking for GPU memory). Quantization (JPEG quality down the model). Each technique is clever. Each delivers real gains. 2x here, 3x there. Stack them together and you get maybe 30–50x improvement, if everything compounds perfectly.

It does not compound perfectly. Flash Attention changes memory access patterns that break batching optimizations. Quantization introduces accuracy loss that limits how aggressive you can go. Speculative decoding adds compute overhead that partially offsets the latency gain. The techniques interfere with each other.

Meanwhile, a 24-person team in Toronto just shipped a chip that eliminates the memory wall entirely and delivers 100x faster inference. Without optimizing around the constraint. By removing it .

This is a post about inference. It is also about a pattern that repeats through every technology stack in history: the tension between chip-level optimization and software-level optimization, between integrated stacks and commodity hardware, between the faster horse and the car.

Bottleneck everyone misses

The bottleneck in LLM inference has never been compute. It’s memory bandwidth.

This sounds like a technical detail. It decides who wins the inference war, and most teams building inference infrastructure don’t know it.

For each output token, the model reads billions of parameters from memory, does a relatively small amount of math per parameter, and produces one token. The arithmetic intensity is low. The GPU’s thousands of cores sit idle, waiting for weights to arrive from HBM (GPUs slow warehouse) . The math engine is a Formula 1 engine bolted to a garden hose. The engine isn’t the problem. The fuel line is.

Moore’s Law gave us compute scaling for decades. Memory bandwidth followed a different, slower curve. NVIDIA GPU compute grew 80x from 2012 to 2022. Memory bandwidth grew 17x. Every generation widened the gap. We crossed the threshold years ago for inference workloads. Most teams still don’t see it because their monitoring tools were built for the compute-bound era.

GPU utilization is visible in every dashboard. Memory bandwidth saturation is invisible unless you instrument for it. Teams optimize what they can see. They’re inference blind because the tooling makes the wrong thing salient.

This is a physical constraint. Moving data costs more energy and time than computing on data. A physical constraint to be navigated, not an engineering problem to be solved.

Seven-layer inference stack

LLM inference has two phases with opposite characters. Prefill ((reads everything once) processes the entire input at once - parallel, compute-bound, similar to training. Decode (writes word by word) generates one token at a time - sequential, memory-bound, irreducibly slow. Every bottleneck this essay describes lives in Decode. Prefill is not the problem.

The observation that reframed how I think about inference infrastructure came from an unlikely analogy a colleague gave: Netezza. The data warehouse appliance that IBM acquired for $1.7 billion. The thesis is simple. Do what they did. Build the seven-layer stack.

Netezza was the first data warehouse appliance. It integrated database, server, and storage into a single purpose-built box. No indexing, no tuning, no partitioning. Plug it in. Run your query. 100x faster than Oracle on expensive Sun servers. The genius wasn’t better query optimization. It was eliminating the I/O bottleneck at the hardware level by controlling the full stack.

Today’s inference stack is the opposite of Netezza. It’s fragmented across seven layers: interconnect and networking at the base, compute silicon, the memory hierarchy, runtime and compiler, orchestration and GPU management, serving frameworks, and API gateway and routing at the top. Seven vendors, seven invoices, seven points of failure. Every boundary between layers is a tax. Latency, serialization, protocol negotiation.

Interconnect is where Nvidia’s structural advantage compounds. NVLink (GPU highway system) delivers 900 GB/s between H100s. Everyone else uses PCIe at 32–64 GB/s. The chip startups attacking the memory wall inherit the interconnect wall by default.

The tax compounds as you move up. The memory hierarchy and the compute silicon sit right next to each other and can barely talk, that’s the 590x arithmetic intensity gap discussed above. The runtime layer tries to paper over it with Flash Attention and KV cache tricks. The orchestration layer tries to paper over that with batching. The serving framework papers over that with speculative decoding. Each layer is compensating for the failure of the layer below it to solve the underlying problem.

Integration beats composition when the bottleneck is at the boundary, not inside the component. If your GPU is the bottleneck, better GPUs help. If the boundary between GPU and memory is the bottleneck, better GPUs don’t help. Tighter integration helps.

This is why the most interesting chip startups aren’t building faster GPUs. They’re attacking the boundary. Taalas collapse layers two and three entirely, weights and compute become the same thing. d-Matrix pushes compute into layer three so data never has to leave. FuriosaAI redesigns the data path inside layer three so the boundary costs less. All three are doing what Netezza did, replacing the boundary with a box.

Nine chip startups, three approaches

I looked at every inference chip startup with working silicon or serious funding. They cluster into three architectural theses, each a different strategy for navigating the same memory wall.

Thesis one: Eliminate the wall entirely

Taalas is the most radical bet in the market. They don’t solve the memory-compute boundary. They erase it. Their HC1 chip takes a specific model, Llama 3.1 8B for the first product, and etches the weights and architecture directly into the wiring of the silicon. The model IS the chip. No HBM, no DRAM, no SRAM, no memory hierarchy at all. A weight and its associated multiply happen in a single transistor.

The results redefine what’s possible. 17,000 tokens per second on Llama 8B. For reference, an H200 does approximately 230 tokens per second per user at comparable latency-optimized single-user inference. Cerebras, previously the fastest at around 2,200–2,268 tokens per second, is roughly nine times slower than Taalas.

The trade-off is total: one model per chip. When the model changes, the silicon is dead. Taalas addresses this with a two-month model-to-silicon cycle and claims 60–75% capex savings over four years even including three chip refreshes. Twenty-four people, $219 million raised, $30 million spent. The economics are either brilliant or delusional. Time will tell.

Thesis two: Move compute to where the data lives

d-Matrix takes a different approach. Instead of eliminating memory, they embed compute into it. Their Corsair platform uses Digital In-Memory Computing, SRAM cells that can perform vector-matrix multiplications directly, without fetching data to a separate compute engine. Two gigabytes of SRAM on-chip, 30,000 tokens per second on Llama 70B, 2ms per token latency.

MatX blends SRAM and HBM in a splittable systolic array. Weights live in SRAM for low latency. KV caches live in HBM for long-context support. The architecture attempts both the throughput of traditional GPUs and the speed of SRAM-first designs. $500 million Series B, led by Jane Street and Leopold Aschenbrenner’s Situational Awareness fund. Not shipping yet. Tapeout expected in 2026, volume shipments in 2027.

Cerebras went wafer-scale. The entire silicon wafer is one chip, packed with SRAM. Keep everything on-chip, never go off-chip. Cerebras and Groq both bet that SRAM alone could avoid the memory wall. LLMs outgrew on-chip SRAM capacity. Both had to retrofit external DRAM.

Cerebras engineering is extraordinary. Also power-hungry. The CS-3 draws approximately 23kW per system, liquid cooling required.

Thesis three: Redesign the data path

FuriosaAI doesn’t try to eliminate the memory wall or put compute inside memory. They redesign how data moves. Their Tensor Contraction Processor uses a circuit-switching fetch network that minimizes external memory transfers and maximizes data reuse across compute units. The result: 2.25x better performance per watt versus GPU solutions, validated in LG AI Research’s production deployment of its EXAONE models, at 180 watts TDP, air-cooled, in a standard PCIe card.

FuriosaAI is the pragmatist in this landscape. Multi-model support. Drop-in vLLM replacement. Standard PCIe interconnect. Kubernetes-native. Meta offered $800 million to acquire them. They said no. Mass production started January 2026, with the first delivery of 4,000 units from TSMC.

Groq built a deterministic dataflow processor, the LPU, that eliminates caching entirely. Pure throughput. Nvidia noticed and acquired their IP for approximately $20 billion, absorbing the threat. SambaNova built reconfigurable dataflow. Both interesting architecturally. Groq is now inside Nvidia and SambaNova has struggled with commercial traction.

Optimization at the chip level always beats the software stack

Every software optimization technique in the inference stack exists because of a chip-level constraint it cannot overcome.

Flash Attention reduces memory reads from O(n²) to O(n) for the attention computation. It still reads weights from HBM. It still pays the bandwidth tax on every token. It’s optimizing the math on top of a slow pipe. It cannot make the pipe faster.

KV cache optimization reduces redundant computation by caching key-value pairs. The cache lives in HBM. Every cache read pays the same bandwidth tax. You’re caching to avoid recomputing, in slow memory.

Continuous batching amortizes weight reads across multiple requests. It only works at high batch sizes, which means higher latency per request. You’re trading latency for throughput. A tradeoff that exists because the underlying hardware forces you to choose.

Speculative decoding generates candidate tokens with a small draft model and verifies them in parallel. It adds compute overhead and requires maintaining two models. The technique wouldn’t need to exist if single-token generation weren’t so slow.

Stack all of these together. Flash Attention, KV cache compression, quantization, batching, speculative decoding. You get maybe 30–50x. In practice less, because the techniques interfere with each other.

Now look at what happens at the chip level. Taalas delivers 100x by eliminating the memory-compute boundary. One architectural decision delivers more than the entire software optimization stack combined. d-Matrix delivers significant multiples by doing the multiply where the weight already lives. FuriosaAI delivers 2.25x per watt by redesigning the data path.

Optimization at a lower layer of the stack always beats optimization at a higher layer, because higher-layer optimizations are constrained by lower-layer physics.

Flash Attention is rearranging the route your pizza delivery driver takes. Valuable. Real improvement. Twenty percent faster. Chip-level optimization is teleporting the pizza.

Except when it doesn’t

The historical record is brutal for integrated hardware stacks.

Sun Microsystems built the integrated stack. Proprietary SPARC chips, Solaris OS, purpose-built servers. Died. Replaced by commodity x86 running Linux.

SGI built the integrated stack. Custom graphics hardware, IRIX OS, proprietary interconnect. Died. Replaced by commodity GPUs on PCs.

Cray built the integrated stack. Custom vector processors, custom cooling, custom everything. Nearly died. The workloads moved to clusters of commodity nodes.

DEC built the integrated stack. VAX hardware, VMS OS. Died. Replaced by commodity PCs.

Even Netezza. The appliance I held up as the model. Acquired by IBM and slowly marginalized as Snowflake and BigQuery ran analytics on commodity cloud infrastructure. The appliance lost to disaggregated commodity compute plus storage.

The pattern is so consistent it’s basically a law of nature. Phase one: integrated stack delivers 10–100x performance. Phase two: commodity hardware gets good enough. Phase three: software layer on commodity closes the gap to 2–3x. Phase four: integrated stack vendor dies, gets acquired, or retreats to a niche. Commodity wins the volume market. Every single time.

The commodity architecture also controls the interconnect. Nvidia is trying to break that pattern. NVLink isn’t just faster than PCIe. It’s proprietary. A startup that beats Nvidia on tokens per second still loses if the model needs four chips to fit in memory.

The argument against Taalas isn’t that their silicon is slow. It’s that commodity GPUs with HBM4, HBM5, and clever software will get good enough for 90% of workloads, and Nvidia will absorb the architectural insights. Exactly as they absorbed Groq’s. The technically inferior commodity architecture beats every technically superior integrated stack because it improves on an industry-wide curve. Thousands of engineers, billions in R&D, shared tooling, interchangeable parts. The integrated stack improves on a company-wide curve. Brilliant, and alone.

The case nobody wants to hear

If commodity always wins, the most uncomfortable inference thesis isn’t Taalas or FuriosaAI. It’s the CPU.

The default assumption that GPU beats CPU for inference is a training-era hangover. Training is compute-bound. GPUs win there. Inference, especially autoregressive decode, is memory-bound. The compute-to-memory ratio inverts. The GPU’s thousands of cores sit idle.

A modern AMD EPYC or Intel Xeon can address terabytes of DDR5. The cost per gigabyte is 10–20x lower than HBM. A 70B model at 4-bit quantization fits in 35 gigabytes. One DIMM slot. GPU servers cost $200K–$400K. CPU servers cost $15K–$30K.

The cost gap is widening, not closing. HBM cost per GB grew 1.35x from 2023 to 2025. DDR fell to 0.54x over the same period. Two curves going in opposite directions.

You deploy 10x more CPU nodes for the same capital. No liquid cooling. No proprietary interconnect. Standard datacenter infrastructure. Every ops team already knows how to manage them.

GPU inference achieves high utilization only at high batch sizes. At batch size one single user, real-time which is most enterprise inference, GPU utilization collapses. CPUs with deeper cache hierarchies handle small-batch, latency-sensitive workloads better than most people realize.

The fact that llama.cpp, the most popular local inference framework, is CPU-first tells you something about where the real market is.

Everyone’s buying a Ferrari to deliver pizza. Most enterprise inference doesn’t need a $300K GPU server. It needs a $20K CPU server running a quantized 8B model that responds in 200 milliseconds.

A framework for deciding

The inference hardware landscape isn’t GPU versus CPU versus ASIC. It’s a spectrum of memory-compute coupling strategies, from loose to total.

CPU sits at the loose end. Cheap memory, low coupling, low utilization. The commodity play. GPU sits in the middle. Expensive HBM, medium coupling, medium utilization. FuriosaAI is tighter. Efficient data reuse, high utilization per watt. d-Matrix is tighter still. In-memory compute, weights live in SRAM. Taalas is the far extreme. No coupling at all because the weights are the chip.

The right answer depends on the workload. Frontier models at massive throughput need the tight end of the spectrum. Enterprise inference on quantized models at batch-one needs the loose end. Most teams don’t know which end they’re on because they’ve never measured.

The customer asking “which GPU, A100 or H100 or B100?” is asking the wrong question. The customer asking “CPU or GPU?” is asking a slightly less wrong question. The right question is: what’s my memory-compute coupling ratio, and what’s the cheapest way to achieve it for my workload?

Where value migrates

The pattern from databases tells you where inference is heading. First era: buy the best hardware. Sun servers, proprietary disk arrays. Second era: the database software layer captures all value. Oracle. Third era: the orchestration and cloud layer captures value. AWS RDS, Snowflake. The hardware becomes a commodity.

Inference is in the transition from era one to era two. Everyone’s buying H100s like they bought Sun servers in 1998. The value is about to migrate up the stack to whoever controls the serving framework and orchestration layer.

The chip startups are betting they can restart era one with fundamentally different hardware. History says the software and orchestration layer wins. Physics says the chip layer might hold a permanent advantage this time. Both are probably right for different segments of the market.

Meter matters more than the engine

For every wisdom that works, an equal opposite wisdom works too. Chip-level optimization beats software optimization. Until commodity hardware closes the gap. Integrated stacks deliver 100x. Until disaggregated stacks get good enough. The faster horse is faster. Until the car arrives. Both truths coexist.

What doesn’t coexist is ignorance. Most teams running inference today don’t know whether they’re physics-constrained or economics-constrained. They don’t know if their workload sits at the tight end or the loose end of the coupling spectrum. They don’t know if they need a Taalas or a $20K CPU server. They’re inference blind.

The teams that win won’t be the ones who pick the right chip. They’ll be the ones who can measure which constraint they’re actually hitting. The meter matters more than the engine.

The constraint always migrates to the boundary you’re not watching. You can’t see migration if you’re staring at just one layer.

mtrajan blog

Discussion about this post

Ready for more?