This topic is, unfortunately, inextricably inter-twined with AI, so I could as easily be talking about one axis of the problem, as the other. But let’s start with the data centers.
It seems to me that nobody is asking the really interesting questions about data centers, namely, “why do they need to be like that?” It’s too easy to just point and say “there are too many of them!” or “they use too much power!” or whatever. Before going further, I should disclose that I come from a different time of computing: the era when a 450mb hard drive (a NEC2363) weighed 400lb and cost $30,000. Nowadays, you can buy a lot of storage for that kind of weight and money, and fill it with imagery that nobody will ever bother to look at, much less comprehend. What’s interesting, to me, is that much of the data we are storing is one of two kinds: 1) surveillance and 2) derivative. Derivative information is stuff that can be reproduced exactly given certain conditions, e.g.: if we were playing an infinity of games of rogue and just stored the high score and the random number generator seed, we could re-produce all the maps in the original condition. The “what happened?” data would be surveillance data. What do we need all this for? Well, the government wants it for a great big retro-scope, but it has been slowly sinking in that a retro-scope has no predictive power. So, they’re adding AI (Palantir, etc.) to generalize from the starting conditions to what maybe happened. Otherwise, a lot of it is in the ballpark of “who cares?” The NSA’s data center (doubtless one of many) in Utah is – so what – petabytes of so what.
[NSA data motel]
Has the existence of the NSA’s massive data heap improved US government security? Not a whit; the SECDEF holds military planning meetings on Zoom. Everyone leaks everything, to the point where The Epstein Files are an open secret – lots of congresspeople have seen them, the FBI is pretending to have forgotten them, and many in Washington are playing this great big gooney bird dance that involves practically everything except just asking the fucking victims who are alive and among us, increasingly furious. Also, for that matter, I wish they’d say why they aren’t talking, too. But that’s another inexplicable aspect of l’air du temps that we shouldn’t try to dig around in.
The folks who are building data centers and AIs are telling us two things (jointly, in harmony)
1) We need all this electricity to run our AI models, which gobble power via stacks of GPUs at a truly amazing rate and
2) We need all the water we can get because data centers need a gigantic amount of cooling because of the aforementioned stacks of GPUs
I remember my first raised floor computing room, it had an IBM 2363 in it. My favorite data center was the one at Dupont, which had a Cray Y-MP. That was a really neat-looking thingamajig. The coolest cool thing about the Cray was that it was liquid cooled. The poor thing was grunting and straining to generate enough instructions/second to run an iWatch and it only cost $13mn. I was a system administrator back then (more precisely: an ULTRIX 32 kernel systems hacker and all around pot washer) and I had a passing understanding of the heat issues in modern computing. Where I worked we had a raised floor room, but the AC was off and the doors were usually open, because the systems we had were all mostly room-temperature stuff and the offices were quite a bit cooler than they needed to be: so there was a fan propping the door open. One day, someone added a DECsystem5500 full of hard drives to the room, and suddenly a bunch of systems shut down because they were over temperature. Very nice.
What’s going on? It’s all waste heat. Problem 1 is that we need more and more electricity because we are running more and more cycles. Problem 2 is getting the waste heat away from our CPUs and hard drives, where it will cause a disaster fairly quickly. (In case this has never happened to you, a modern CPU without a heat sink can blow its top hard enough to punch a hole in the motherboard; it sounds like a .22 rifle) Now’s where I’m going to get myself a bit into the weeds. I may be wrong about a few things, so don’t hesitate to tell me.
There is a simple cure for “too many cycles” – which is to use fewer cycles. This is an old trick. I used to own an HP something or other laptop that had amazing battery life because, by design, it didn’t do much. The rocket scientists at HP had figured out that most of the system could be asleep if all you were doing was running Word (there’s probably a metaphor in there somewhere) and – almost as importantly – if most of the system was asleep by design, then turning it on and off was just a matter of waking part of the system. Why did it get such great battery life: fewer cycles. I think most laptops nowadays have ways of accomplishing this – you just don’t run everything at full speed when you’re not hooked up to the wall. Of course, in a data center, you’re running everything at full speed, all the time, right?
[A data center from the future. All of the hoses are carrying beautiful blue-glowing glycol to a central heat-exchanger. Other than the heat put off by the hoses, the data center runs at room temperature, with maybe a small air conditioner]
Yep, except that a lot of it is crap. What the laptop builders have figured out is that (for example) when you’re running an application mix that is mostly making Windows UI calls, there’s no need to wake up the GPU. It’s Microsoft Word, for god’s sake – it doesn’t require 3D texturing and caustics for the ugly marketing document you’re working on. We used to call this system performance tuning and it was my baliwick. [USENIX San Diego, 1996] It’s probably almost impossible, now, since a lot of where systems spend their time is their CPUs trying to non-destructively predict the future. But back in the 90s it was possible to generate a bunch of hooks into the O/S and see where it was spending most of its time. Around that time, it was a big affair – huge vendors like Sun and Digital would crush eachother on benchmarks by changing kernel behavior to process a bit better on the benchmark and be interrupted a bit less. But there were also massive performance boosts. Some of the guys at Berkeley discovered that, as TCP/IP networking began to take over computing per se, the amount of time spent figuring out (in the kernel) which socket was getting a blob of data, was becoming expensive. BSD’s networking stack was designed for hundreds of connections and by 1996 websites were managing tens of thousands of connections. So, the BSD guys put a little hash-table lookup in front of the socket list, and overnight the kernel was about 40% faster for some loads. Several times faster for loads like “being a web server” which was a big deal in those days. Then there was application performance tuning which consisted of compiling your code with markers in it, and seeing where it spent most of its time. In those days I was still coding, so a Marcus application could get sent a signal and it would dump all of its statistics to a file while it was still running. Things like cache hits versus misses, cache age-outs, etc. This was all part of what we then understood as “understanding your software” and, to be honest, it was why some people had a reputation for writing stuff that ran remarkably fast, and others – didn’t. There was one time, on a challenge, that I improved a client’s benchmark code so it ran 30,000 times faster and demonstrated that, to their amazement, on a DECSystem5500 before stopping and explaining, honestly, that the performance wasn’t because of the hardware, it was because their application had been doing a linear search (good god!) and I put a simple hash index in front of it. To keep all this tied together, consider a linear search versus a hashed search: suppose there are 1,000,000 entries in your wossname. A linear search looks at (on the average) 500,000 entries each time you search for one. If you put a hash table in front of it with 1075 buckets, on average you’re looking at something like 700 entries. If you put a b-tree, you’re looking at something like 2.
The point is that the implementation needs to be optimized for the problem that is at hand. And, most importantly, for the execution branch that takes the most time. Consider something that LLMs do: they convert input words into tokens, basically numbers, which then get marshaled through assloads of matrix multiplications (a faster way to implement a directed graph) So, I know one thing for a fact right now: there are at least 2 companies exploring developing Very Large Scale Integration (VLSI) silicon which do the parts of AI processing that GPUs do. What most people don’t know or care about is that almost all of the nifty cool functionality of the GPU is unnecessary – it’s mainly valuable as a pool of memory that’s not on the system main bus – think of it as a magic box you hand a ton of stuff to, along with the program to run – and you’re using a tiny percentage of its capabilities really hard and really heavy. Of course different implementations will be different but right now the way to write AI software is to use a GPU running ‘Cuda and use as many cores as you can for parallelism, and … that’s how it’s done. A VLSI matrix multiplier/RAM cache board will (once it’s developed) completely kick a GPU’s ass into the weeds, once someone writes code that either runs its native framework, or translates the necessary parts of ‘Cuda and nulls the rest out. That may sound crazy, but I remember back in the day when one of the engineers up on Maynard developed a cross-compiler that ate 80486 code and spit out Alpha2164 code. Rarely-used instructions were replaced with jumps to software emulations on the Alpha, but the frequently used stuff ran with the control rods out, superscalar, superpipelined, and with a snort of meth washed down with tequila. It turned out to be a problem because it ran Windows faster than anything Microsoft had ever seen, which made them ask Intel “WTF?” and it made Intel unhappy because it blew the 80486 out of the water, too. My point is not that someone is going to do exactly this, but rather that THE CURRENT SITUATION IS NOT SUSTAINABLE. Therefore it will not be sustained.
The data center scalability problem is going to be solved with better algorithms, better tailored hardware, better data representations, better frameworks, etc. If you look at every generation of computing, it’s easy to go “what the hell were they thinking?” because ‘they’ hadn’t thought of the better solution – yet. Imagine that humanity suddenly develops a great desire to move dirt. One farmer discovers that his pickup truck is good for moving dirt. Soon all the humans are buying pickup trucks and moving dirt. Industry analysts predict that humanity will starve because the pickup trucks will consume all mankind’s resources. Investors stroke huge checks for new factories for pickup trucks. Pickup truck operators stagger drunkenly into casinos and bet a month’s wages on a single throw of the dice and walk away laughing. Then, one engineer at some pickup truck company invents a dump truck. And another invents a backhoe.
Analogy aside, another extremely interesting possibility may be that the problem optimizes away most of itself. What do I mean? I lived through the internet firewall industry. Arguably, I created the internet firewall industry over christmas break in 1987 when I was working as a consultant for Digital. The first firewalls were software stacks that ran in application space. The second generation ran in kernel space and were much faster. Now, a firewall is a feature on any number of network physical interface chips – you just plug in the behavior you want and it works. You buy a phone, one comes baked in for free. Does it have all the features that the first generation firewalls had? No, of course not. But what has happened is, like a finely aged prosciutto, all the non-firewally bits have been sliced away until the concept is so purely distilled it can be implemented in VLSI on any physical interface that also has a CAM table for address lookup. It took about 20 years for firewalls to mature themselves out of existence. It’s going to take less than that; there will be LLMs running in phones and cars, which means they will have to be running on custom VLSI or using algorithms that some guy somewhere (probably in China) is just saying, “hey… what if instead of doing this we did that?”
That’s enough for part 1. Something’s going to break – in a good way – and when it does, it’ll be interesting. The power consumption may drop spectacularly. Most of the power consumption of the data centers is spent running pickup trucks. The industry is unlikely to keep adding more and more pickup trucks because, right now, the plans for dump trucks and conveyer belts are being implemented. Laissez les bon temps roulez!
Now, let’s talk waste heat.
What the fucking croque de merde!? What the HELL is “waste heat”? Sure, joking aside, what we’re really talking about is “heat where we don’t want heat to be.” So how do we move heat? There are techniques for this! It turns out that heat is useful! In a fit of absurdity, modern data centers are focused on “how do we get rid of waste heat?” In a sensible world, people would ask, “who might buy assloads of heat?” Alright, now, it’s not hot enough to make steam, so we can’t generate electricity with it, but… Hang on, I’m getting ahead of myself. Picture a data center that, instead of having fans blowing heat into the air, which is then cooled with air conditioning – each computer is connected with some hoses to a loop of hoses to a central manifold, and all the hoses are running glycol. Not drinking water: glycol. It’s a much better heat exchange medium, and if you have it in a closed loop, you don’t lose it and you don’t have to replenish it constantly. Also, it does not grow bacteria and fuck up the thermal join at the CPU. Now, there’s a massive hot loop of glycol that goes in to a chiller/heat exchanger. There, the “waste” heat is now captured in the form of hot water, which is practically of infinite usefulness. A smart data center could:
- Contact the local township and offer to sell them hot water in the winter.
- Build an evaporation chamber where hot water steams off the surface, and is distilled into fresh water. Desalinization is a big deal. Is this the most efficient way to do it? Of course not, but since the whole purpose of the system is to cool down water, consuming that heat in the latent heat of evaporation is just common sense. Instead of consuming water, the data center now produces fresh distilled AI SWEAT(tm) water.
- Build a large greenhouse next to the data center, which grows delicious mango-nanas year-round.
- Build a wet sandmound heat battery that captures some of the heat, so it can be sold as pre-heated water to any process requiring hot but not boiling water.
- Even processes that could use boiling water will save a lot of money if they’re taking in plausibly hot water then boosting it to boiling. Imagine a solar farm of, oh, mirrors coinciding on a small pipe full of pure steaming hot AI sweat which flashes into steam and, uh, drives a piston. It’s amazing how energy capture always has steam; I have come to believe that the steam transition is so miraculous I should believe in god, but I don’t.
I’m going to conclude by pointing out something that ought to be obvious: in tech, change is evolutionary, progressive, and relentless. Whatever you think is a super cool idea right now, will eventually be looked upon with mild distaste or shocked horror – like telling kids nowadays that my first real program was a shoebox full of punch-cards and I lost it when the tape tore at the corner of the box and the cards went cascading down the back steps of Gilman High School. From there, I graduated, indeed, to 5-something-inch floppy disks and then 3-inch floppy disks and then a hard drive that held a whopping 10 megabytes and cost $200. There’s a point to that progression: the floppies were an incremental change in technology. The switch from floppy to hard drive was a mind-blowing adjustment for me. 5 years after that, I was managing systems that had 100 megabytes of hard disk space. The more it changes, the different it is.
As I write this, I saw some interesting stuff (replete with AI art of its own) regarding supposed Chinese data centers that are submersible. So, you have wave-generation for power, and an ocean for cooling. It seems like fiction, to me, since I tend to think the Chinese are smarter than that. For one thing, the entire system is integrated so it’s tightly coupled – which means that an “oops” over here is a deadly expensive crash over there. I can see ocean power generation; great idea. The Chinese are killing it, in terms of extracting power from all sorts of things. But the whole value of having a “power grid” is that the elements are decoupled except through a common interface, the grid. I also suspect that the Chinese will be some of the leaders in re-writing AI software to make it lighter, faster, and better. They recently did a version of an AI LLM+ called “DeepSeek” which flabbergasted US AI strategists. The US had expected to be able to stymie Chinese attempts at building AI, by making it hard for them to buy pickup trucks GPUs in massive quantities. One of the big costs in producing AIs is training the knowledge-bases that the AIs use. Companies like OpenAI were bragging about spending billions building their knowledge models, and the Chinese built an AI teacher – an AI trained to train other AIs. So, instead of throwing a billion tons of data at the thing and expecting it to sort through it, the Chinese AI simply told the trainee AI, “these are a billion important things.” I don’t want to rain on anybody’s parade but it’s possible that the US’ immediate reaction of restricting Chinese access to pickup trucks GPUs encouraged them to invent the dump truck AI training AI. At present, I characterize the US AI research environment as:
- Collect underpants
- ??
- Profit!
Joking aside, some good will come of it. Some bad, too. No doubt. That’s how technological progress works.

Leave a Reply
You must be logged in to post a comment.