Introduction
For decades, system architects have lived with the consequences of the Von Neumann bottleneck. Compute kept getting faster, and memory kept getting bigger, but the path between them never kept up. So, we built caches, hierarchies, prefetchers, NUMA domains, and increasingly complex memory trees to work around it. We moved memory closer, widened buses, tuned access methods, and accepted that performance was often dictated less by raw compute than by how efficiently data could be fed to it. Each attempt to narrow the gap introduced another layer meant to pull state closer to compute, but still the hierarchy remained. That way of thinking shaped system design for nearly 30 years, and for a long time, it was enough.
When GPUs arrived, it felt like we had found a loophole, but in reality, we only really shifted it away from the primary CPU. By tightly coupling massive parallel computing with high-bandwidth memory, GPUs appeared to have torn down the memory wall entirely. HBM was the game changer, and it still is, but not for the reasons many assumed. Bandwidth exploded, latency dropped, and an entire class of workloads suddenly became viable outside national labs and private research facilities running supercomputers. That idea quickly fizzled as we discovered it never really disappeared, but was being masked by supercomputer-class GPUs coupled with HBM.
Today, the same fundamental constraint that once lived between CPU and DRAM now lives inside the GPU itself, and I call this the Nebula Gap, the growing structural distance between on-package HBM and everything else the workload depends on to execute at scale in production environments.
The GPU is a Microcosm of the Original Problem
Modern GPUs are no longer accelerators in the traditional sense, and they are not your everyday general-purpose x86 servers, not by a long shot. They have their own processors, memory hierarchy, internal fabric, and rules about locality. In effect, they are self-contained mini super-compute islands, purpose-built for one thing, intensive, complex computation. One of the things that is central to the GPUs today is High Bandwidth Memory. And it is amazing with its extreme bandwidth and remarkably low latency. It was the key that unlocked the core of this article for me. However, even with the speed and low latency that it delivers, it is also finite, very expensive, power-dense, and supply-constrained.
At the beginning of the AI gold rush, the results we all witnessed were not only impressive but also incredibly fast, faster than many anticipated. As models grew, context windows expanded, and the intermediate state ballooned. This set up the working set to inevitably outgrow what can be held entirely in on-package memory. When that happens, performance does not just slowly degrade, it falls off a cliff. The moment data spills out of HBM and must be fetched from system memory, or worse from storage, the GPU shifts from being compute-bound to being starved. This is the classic hit-and-miss scenario where significant latency is introduced. When there is a miss, the cores sit idle waiting for data to arrive across interfaces that were never designed to sustain that level of demand.
This is what fueled the high-performance parallel filesystem narrative, with the assumption that if GPUs were starving, then storage must be the chokepoint. Again, we went back to our 30-year-old assumptions. The logical answer was to make the filesystem faster, increase throughput, reduce latency, and feed the cores. For a time, I subscribed to that assumption as well because it did seem to make sense, especially based on past conversations with supercomputing customers. When I was working with the team at the Missile Space and Intelligence Center in Huntsville, Alabama in 2006, the director told me, “David, if we are not constantly feeding these cores with data for compute, then we are burning valuable taxpayer money. We need a filesystem that can perform to meet those expectations.” Now, he was right about something that still holds true today, idle cores are expensive, and nobody wants to see idle cores because that destroys any ROI suppositions. Whether these cores sit in a government supercomputer in 2006 or in a hyperscale AI cluster, or a customer’s datacenter in 2026, the economics are the same. Compute that waits is like the light you leave on in your house while you are on vacation, it is utility you are paying for without any return.
At the time, it seemed obvious. If we make the filesystem faster by focusing on throughput and lowering latency, we can consistently feed the cores. That thinking was not irrational, it was consistent with the architectural logic we had all grown up with. If compute is starving, the assumption is that the layer feeding it is the chokepoint, so we optimize that layer and expect the problem to go away. What we have learned over time, though, is that feeding compute has never been about one layer alone. The chokepoint does not stay put, and it isn’t always in the same place. It moves as each layer improves, just as you remove pressure in one part of the hierarchy it simply shows up somewhere else, because the hierarchy itself still exists. That’s basic systems engineering dynamics, look for friction and eliminate it if you can. What was true in 2006 is still true today because, let’s face it, you cannot change physics, but you can create the illusion that you have. The difference is that today the hierarchy has been compartmentalized and now lives inside the GPU, and the bottleneck is no longer just about storage bandwidth or filesystem latency, it is about how close the working set can remain to the execution core before physics and economics force it farther away.
All this to say, the memory wall didn’t go away, it just moved. Amdahl’s Law has long reminded us that accelerating one portion of a system only magnifies the impact of what remains unaccelerated. In AI systems, that remaining fraction is increasingly pointing to memory locality.
I have always thought about systems the way a plumber thinks about water flow. If you want a powerful stream from the faucet or spigot, you cannot just widen the final section of pipe. Every segment of the line has to support the pressure, and the source feeding it must exceed the capacity of the pipe itself. Somewhere upstream the pipe narrows, pressure drops, and flow becomes inconsistent. No matter how large the spigot appears at the end, you will never get the output you expect. That is what happened with GPUs, we increased the compute performance dramatically, but the upstream staging and locality layers were never redesigned to sustain that level of demand. The result is turbulence in the system that shows up as idle cores and unpredictable bursty performance.
The answer is not simply faster storage or more HBM, but recognizing that memory locality is still the dominant architectural chokepoint. Today’s AI systems do not lack compute density, they lack an intelligently staged hierarchy that keeps the working set as close as possible to the execution core without forcing everything into the most expensive tier, HBM.
This is where the missing layer becomes clearer. Between on-package HBM and traditional storage there must be a memory tier that absorbs the spillover without killing performance. It has to be close enough to avoid starvation, elastic enough to scale, and affordable enough to deploy across the environment.
Until that layer exists and is orchestrated correctly, the Nebula Gap remains.
Why “Just Add More HBM” Is Not a Strategy
It’s tempting to treat this as a capacity problem that can be solved by simply adding more on-package memory, but HBM brings its own set of challenges, such as supply allocations, thermals, power density, and cost. We have seen this play out in other areas before, and what may be obscuring the vision of some today is the promise of HBM and the belief that it can do so much more, but we have to remember that HBM should be reserved for what it does best, serving as the G0 memory tier that feeds the GPU. Going back to my plumbing example, expanding this would be like placing a 3-inch copper pipe at the end of a pipeline that is only 1½ inches from the source. Expensive and impressive at the outlet, but still limited by what feeds it upstream. Even as densities improve, HBM will remain a scarce and premium tier for the foreseeable future, and treating it as the sole tier of fast memory does not scale economically or operationally. More importantly, to quote something I have been saying for over two decades now, “not all data is created equal,” therefore, not all data deserves to live in HBM. For most AI workloads, there is a small fraction of data that is truly hot, while a much larger fraction is warm, and the rest is cold but still performance-relevant. That’s why forcing everything into the hottest tier is neither necessary nor efficient. The real problem isn’t that HBM is too small, but that everything else is too far away.
The Emergence of the Missing Link
As I’ve said throughout this article, the issue is not a lack of compute power or raw bandwidth. What AI workloads lack is memory, not necessarily more HBM, but a memory tier that sits between HBM and storage, close enough to avoid starvation but elastic enough to scale, and purpose-built for the way AI systems actually behave.
Sounds simple when it is said in a short paragraph like that, doesn’t it? But this is an “Admiral Grace Murray Hopper” moment. In other words, it requires rethinking how we have always done things and, in this case, how memory is provisioned, exposed, and governed across the stack.
For a while, fabric-attached memory has been positioned as the answer, and technologies like CXL absolutely change the conversation because they allow us to decouple memory from a single device and treat it as something shared and accessible across the fabric. That is a step in the right direction because it allows memory to be pooled, staged, and accessed beyond the physical confines of a single server or GPU, and when you scale that model to ubiquitous rack-level access, it begins to close part of the gap.
This may come as a surprise to many who have followed me for some time, but CXL is not the missing link, it is one section of the end-to-end chain that moves us closer.
The deeper issue is that we are still treating most system memory as if it were general-purpose server DRAM, just larger and more distributed, when AI has already proven that purpose-built compute changes the game. GPUs were not incremental improvements to CPUs, they were architectural leaps designed around one objective. So, if AI compute required that level of specialization, why would AI memory continue to be treated as a generic extension of legacy server design? What is emerging is not just fabric-attached memory, but the need for memory that acts differently, memory that understands locality, staging, and workload dynamics at the system level, memory that can absorb spillover from HBM without stalling performance, and do so predictably across the fabric. In practical terms, this becomes a tier between HBM and storage, but architecturally, it is something much larger, the beginning of memory being orchestrated and managed with the same intentionality as storage.
We are approaching the limits of treating memory as a supporting actor in the general server era. AI is the forcing function that led us to the development of specialized compute, specialized memory, and now a specialized tier of memory to support continued growth. If we choose to develop it. The next chapter for AI will not be defined by faster accelerators alone, but by how intentionally memory is designed, intelligently orchestrated, and made available across the fabric.
That is the missing link, not a single protocol or bus, but a shift in how memory is designed, staged, and treated across the system.
What this Means for Today
Data pipelines are top of mind in every AI discussion right now, but the pipeline is not dependent on one single component. It is not just the data, it is not just the filesystem, and it is not just the number of GPUs in an environment, although ironically it can feel that way because the more GPUs you buy the more HBM you get, and HBM is the rare mineral in this entire end-to-end system. And in the last part of that sentence lies the constraint, because HBM is finite, expensive, and physically bound to the accelerator, and without enough of it, your GPUs stall. When more memory is required, the only immediate option is often to acquire more GPUs simply to gain access to more HBM, which leads to oversubscription and ultimately a very expensive GPU farm that looks impressive on paper but sits underutilized in practice, distorting capital allocation that will compound over time, ultimately eroding ROI.
So the tradeoff becomes economics over outcomes. The people who want faster results do not want to wait, but the same people signing the checks do not want to see idle GPUs either. And if you do not solve for memory locality, you end up paying for peak compute that you cannot consistently feed. At the same time, the software layer is finally starting to catch up. Memory-aware runtimes, intelligent tiering, and transparent migration of state across memory domains are making it possible for applications to see what looks like a large, contiguous memory space without having to know where every byte physically lives. Slowly but surely, the system, not the developer, begins deciding what is hot, what is warm, and what can be staged elsewhere.
The Deeper Implication
You may be thinking this is getting overly complex, why not just expand server memory with DRAM and leverage it there, why introduce pooling at all when the goal is simply more memory close to compute. But this is not just about adding memory, it is about changing how memory acts inside the system and how it is made available to compute. If we only look at this from a capacity perspective, then yes, adding more DRAM to a server feels right because that is how we have solved problems for years, scale up the box and move on. What I’m arguing is that we stop thinking about memory as something statically attached to a single device and start thinking about it as a shared resource that is pooled, managed, and orchestrated across the fabric, much like storage has been for decades.
Once memory becomes elastic and coordinated at the system level, the center of gravity shifts and performance is no longer defined solely by the fastest chip in the rack, it is defined by the shortest and most predictable path between data and compute across the entire environment, and that path now spans interconnects, fabrics, and memory domains. Think about what we have actually been doing over the last two decades. As CPUs advanced and then GPUs accelerated even further, much of the innovation was not about eliminating stalls but about recovering from them faster. Deeper pipelines, out-of-order execution, massive parallelism, warp scheduling, aggressive prefetching, all of it allowed processors to tolerate imperfect data paths by switching context, masking latency, and resuming execution quickly when the data finally arrived. In many ways we have been designing faster failure and recovery units, not removing the underlying cause of the stall itself.
Now ask some different questions:
- What does it look like when we fix the data path rather than build faster mechanisms to survive it?
- What happens when compute is no longer optimized to recover from starvation but is fed in a way that makes starvation the exception rather than the norm?
At that point we are no longer designing systems that fail fast and recover faster, we are designing systems that stream data into the core in a sustained and predictable way, allowing the accelerator to operate at the level it was actually engineered for. As packaging evolves from 2.5D toward 3D integration and tightens the physical relationship between compute and memory, the importance of controlling that path at the system level only increases, because once the local boundary becomes incredibly efficient, every inefficiency upstream becomes magnified, and that is precisely why AI is the forcing function to finally confront these long-standing architectural issues, namely the memory. This reduces the strategic advantage of concentrating memory inside a single device and elevates interconnects, memory semantics, and orchestration to first-class design concerns, turning memory locality into the dominant factor in AI efficiency. None of this diminishes the importance of the GPU, but it does make clear that without addressing this constraint at the system level, the GPU alone is no longer sufficient.
Closing Thought
The Von Neumann bottleneck never really went away, it evolved. We thought we addressed it once by moving memory closer to compute, but now we are encountering it again, compartmentalized inside the GPU, amplified by scale, and exposed by AI workloads that refuse to fit neatly into fixed architectural boundaries. I am not saying there is anything wrong with GPUs or with the advancements we have seen in this space. On the contrary, without the year-over-year acceleration in GPU capability, we might never have forced the underlying issue to surface again in such a visible way. The Nebula Gap is not a failure of GPUs, it is the beacon that points to the next inflection point in system architecture. If this unfolds the way I believe it will, the organizations that close that gap will not be defined by faster accelerators alone, but by how intentionally memory is designed, staged, and intelligently orchestrated across the fabric.
