When you enter the semiconductor lab at Stanford’s School of Engineering on any given afternoon, you will see graduate students gathered around a machine doing what appears to be incredibly tiresome work: automated electrical tests on dinner plate-sized wafers, searching for the faint signals that indicate something new is truly working. The work is meticulous, slow, and unglamorous. However, what is currently sitting on those wafers could actually change how computers are made for the next few decades, researchers will tell you with the quiet intensity of people who think they are onto something.
Instead of spreading flat, the chip in question rises upward. That sounds straightforward, and in theory it nearly is: stack memory and computing layers vertically, connect them with thick internal wiring banks, and all of a sudden you have resolved two issues that have been subtly impeding the development of contemporary AI hardware. One is the difference between a processor’s operating speed and the speed at which data can be fed through the chip’s constrained pathways, which engineers refer to as the memory wall. The other is the approaching physical limit known as the “miniaturization wall,” where transistors have been reduced to a few atoms across and can’t be made any smaller without losing their reliability. For years, flat chips have been in danger of exceeding both of these boundaries. Building upward attacks both simultaneously, as demonstrated by the team at Stanford, Carnegie Mellon, Penn, and MIT.
| Research Lead | Subhasish Mitra — William E. Ayer Professor, Electrical Engineering & Computer Science, Stanford University |
|---|---|
| Institutions Involved | Stanford University, Carnegie Mellon University, University of Pennsylvania, MIT |
| Manufacturing Partner | SkyWater Technology — Bloomington, Minnesota (largest exclusively U.S.-based pure-play semiconductor foundry) |
| Breakthrough | Monolithic 3D chip stacking memory and compute vertically; prototype beats comparable 2D chips ~4× in hardware tests; simulations show up to 12× on AI workloads |
| Funding | DARPA, U.S. National Science Foundation, Department of Energy, Samsung, Microelectronics Commons AI Hardware Hub |
| Chinese Rival (ACCEL) | Tsinghua University — photonic/analogue chip reaching 4.6 PFLOPS; claimed 3,000× faster than Nvidia A100 on specific vision tasks; 4 million× lower energy consumption |
| Princeton Project | Prof. Naveen Verma leading DARPA-funded OPTIMA program ($18.6M grant) — in-memory computing architecture via startup EnCharge AI, Santa Clara, CA |
| Key Problem Being Solved | The “memory wall” (data movement bottleneck) and the “miniaturization wall” (transistor scaling limits) |
| Broader Context | AI computing demand grew ~1,000,000% between 2012 and 2022; current leading models trained using trillions of variables |
| Reference | engineering.stanford.edu — Stanford School of Engineering |
Tathagata Srimani, who oversaw the research paper’s senior authorship and is currently employed at Carnegie Mellon, used almost architectural language to describe it. He compared moving data vertically through stacked layers to having elevator banks in a high-rise instead of making everyone go through a single congested ground-floor hallway. A different building metaphor, the Manhattan of computing—more capacity in less space—was employed by Robert Radway at Penn. Because engineering is so abstract that analogies are the only way in, it’s possible that the best technical descriptions are always those that make reference to the real world. In early hardware tests, the prototype outperformed similar flat chips by about four times, regardless of the comparison. As the design develops, simulation models predict twelve-fold improvements on actual AI workloads.
Such breakthroughs are, of course, performance-related. However, they also have to do with capability; if we can create sophisticated 3D chips, we’ll be able to innovate more quickly, react more quickly, and influence the direction of AI hardware.”
— Stanford School of Engineering’s H.-S. Philip Wong
The method used in this effort differs from previous stacked chip experiments. In earlier attempts at three-dimensional chip architecture, individual chips were frequently constructed and then bonded together. This method resulted in relatively coarse connections between layers and the creation of new bottlenecks at the joins. The Stanford-led team employed a technique known as monolithic 3D integration, in which each new layer is continuously grown directly on top of the preceding one at low enough temperatures to prevent damage to the circuitry already constructed below. The resultant connections are quick, tight, and dense. Crucially, the entire thing was made at the Bloomington, Minnesota, foundry of SkyWater Technology rather than in some academic cleanroom that doesn’t exist outside of a university basement. Given that the politics of semiconductor supply chains are now as contentious as any other issue in technology policy, that is extremely important.
On the other side of the world, a research team at Beijing’s Tsinghua University has been working on a project that tackles the issue from a completely different perspective. There is absolutely no electrical current used by their chip, known as ACCEL. It makes use of light. In lab settings, this results in computing speeds of 4.6 peta-floating-point operations per second on specific AI tasks—roughly 3,000 times faster than Nvidia’s A100 GPU, the workhorse of high-end AI development, at least for those who can still afford one. Photons replace transistors, and optical signals replace electrical ones. Additionally, the researchers found that ACCEL uses four million times less energy than the A100. The chip’s analog architecture does restrict it to particular tasks like image recognition and traffic analysis rather than general computing, and that figure is so striking that it deserves some skepticism. However, if those efficiency figures hold true outside of carefully regulated laboratory settings, even a partial application would be noteworthy.
Observing the parallel development of these two projects gives the impression that the industry is nearing a long-awaited turning point. The best AI currently available is almost exclusively found in data centers, according to Naveen Verma of Princeton, who is in charge of a different DARPA-funded project called OPTIMA. A phone, a hospital diagnostic device, or a factory floor sensor cannot accommodate chips strong enough to run serious models because they are too large and power-hungry. He contended that the true value will come from unlocking AI from the data center, not from training ever-larger models but from operating them affordably, locally, and effectively where the real work needs to be done.
Sitting with that final section is worthwhile. The Nvidia GPUs that have temporarily made the company the most valuable semiconductor company in history are amazing devices, but when you consider the server racks needed to power and cool them, they are the size of tiny refrigerators. Because they are so costly and challenging to replace, big businesses are said to transport them in armored vehicles. The notion that AI processing could one day run on a chip in a wearable gadget or an electric vehicle using a small portion of that energy is not science fiction; rather, it is the declared objective of several serious research programs at various serious institutions, supported by serious government funding. It’s another matter entirely if any of them arrive on time. These things hardly ever do.
It’s difficult to ignore the fact that the discussion surrounding chips has moved from being almost exclusively about raw speed to becoming more and more focused on efficiency, thermal control, and domestic production. Beneath it all now is the geopolitical layer. Because advanced lithography equipment capable of finer fabrication is subject to U.S. export controls that have effectively prevented Chinese chip companies from accessing the newest tools, China’s ACCEL chip was constructed using a 20-year-old manufacturing process, not because Tsinghua preferred it. Due to this limitation, Chinese researchers have turned to entirely different architectural strategies, such as photonics, analog computing, and light-based systems that avoid the arms race of shrinking transistors. It’s still unclear if those routes lead to truly competitive outcomes or if they are the kind of ingenious workaround that looks good in a lab paper but falters in practical implementation. Both outcomes are possible, according to history.
It appears that the days of merely reducing transistor size and waiting for performance to increase are coming to an end. Future AI systems will require 1,000-fold hardware improvements, and incremental improvements to current flat chip designs won’t get there, according to Stanford researchers. The solution necessitates a fundamental rethinking of what a chip is, whether through vertical 3D stacking, photonic computing, in-memory processing, or some combination not yet fully described in any paper. The problem is more fascinating than it may seem. Additionally, it appears that the labs working on it are aware of it as they conduct their quiet tests on their tiny wafers.
