Analysis Hotter and more power-hungry CPUs and GPUs were already causing headaches for datacenter operators before Nvidia unveiled its 1,200W Blackwell GPUs at GTC last week.
Over the past year, datacenter operators and colocation providers have expand support for high-density deployments through the use of rear-door heat exchangers (RDHX) and in some cases direct-to-chip (DTC) liquid cooling, in anticipation of rising chip temps.
Looking at Nvidia's Blackwell lineup, it appears these modifications have been warranted. At roughly 60kW per rack — 14.3kW per node — a stack of four DGX B200 systems is already pushing the limits of standard air cooled racks in Digital Realty's facilities.
And that's not even Nvidia's most powerful system. Its latest GB200 NVL72 rack-scale systems, which we looked at in detail last week, are rated for 120kW and - to no one's surprise - absolutely demand liquid cooling.
This is a lot of heat to wick in a rack but there's more to the story. Let's take a look at Blackwell's power and efficiency gains.
Performance per watt
During the Blackwell launch, Nvidia made bold claims about the performance and efficiency of its chips. We'll get to those a bit later, but for now let's take a look at how these chips stack up in terms of raw floating point operations per watt (FLOPS/W).
GPU Perf/W
B100 (SXM)
B200 (SXM)
GB200 (GPU only)
H100 (SXM)
A100 (SXM)
TDP
700W
1,000W
2,400W
700W
400W
TF32
2.6 TFLOPS/W
2.2 TFLOPS/W
2.08 TFLOPS/W
1.41 TFLOPS/W
0.78 TFLOPS/W
FP16
5 TLOPS/W
4.5 TFLOPS/W
4.16 TFLOPS/W
2.82 TFLOPS/W
1.56 TFLOPS/W
FP8/INT8
10T(FL)OPS/W
9 T(FL)OPS/W
8.33 T(FL)OPS/W
5.65 T(FL)OPS/W
3.12 TOPS/W
FP4
20TFLOPS/W
18 TFLOPS/W
16.66 TFLOPS/W
NA
NA
Note: We didn't include FP64 performance in this lineup as Blackwell actually performs worse than Hopper in double precision workloads.
Looking solely at the GPU efficiency, Blackwell shows strong gains, offering about 1.7x higher efficiency compared to Hopper and 3.2x that of Ampere when normalized to FP16. Obviously, if your workload can take advantage of lower precision then you can expect to see even stronger gains, but the conclusions remain about the same.
But when we compare the Blackwell GPU SKU's we start to see diminishing returns on performance past the 700W mark. While it might look like we're just trading power for FLOPS with the 1,000W B200 and the GB200's twin 1,200W accelerators, that's not quite accurate.
Unlike the H100, none of the Blackwell parts are available as a standalone PCIe card, yet.
This means you're going to be buying them as part of an HGX, DGX, or Superchip-derived configuration. This means the minimum configuration is going to be two GPUs with the GB200 or eight with the HGX B100 or B200-based systems.
System Perf/W
HGX B100*
DGX B200
GB200 NVL72
DGX H100
DGX A100
TDP
10.2kW
14.3kW
120kW
10.2kW
6.5kW
TF32
1.41 TFLOPS/W
1.23 TFLOPS/W
1.5 TFLOPS/W
0.77 TFLOPS/W
0.38 TFLOPS/W
FP16
2.74 TFLOPS/W
2.51 TFLOPS/W
3 TFLOPS/W
1.55 TFLOPS/W
0.76 TFLOPS/W
FP8/INT8
5.49 T(FL)OPS/W
5.03 T(FL)OPS/W
6 T(FL)OPS/W
3.10 T(FL)OPS/W
1.53 T(FL)OPS/W
FP4
10.98 TFLOPS/W
10.06TFLOPS/W
12 TFLOPS/W
NA
NA
Note: Since there is no DGX B100 configuration, our "HGX B100" figures are based around the DGX H100's 10.2kW max power draw, since it's a drop-in replacement designed to work within the same thermal and power constraints.
Looking at efficiency of the the air-cooled systems fully loaded with CPUs, memory, networking, and storage, we see that even with a larger 10U chassis to accommodate a larger stack, the DGX B200 appears to be less efficient than the HGX B100.
So what's going on? As you might already suspect, 1,000W is one heck of a lot harder to cool than 700W, especially since the fans have to spin faster to push more air through the heat sinks.
We get a better view of this when we add in Nvidia's power hungry GB200 NVL72 with its 120kW appetite.
At the rack scale, we're comparing four DGX style systems per rack against a single GB200 NVL72 setup. Again we see a familiar trend. Despite the water-cooled system's GPUs running 200W hotter than the DGX B200, the rackscale system manages to exert 2.5x the performance while consuming a little over twice the power.
From the graph, you can also see that the liquid cooled NVL system is actually the most efficient of the bunch, no doubt owing to the fact it isn't dumping 15-20 percent of its power into fans.
Further, you will still need to power facility equipment like coolant distribution units (CDUs), which aren't accounted for in these figures, but neither are the air handlers required to cool the conventional systems either.
And here is where we can start drawing some conclusions about what Blackwell will mean from a practical standpoint for datacenter operators.
Nvidia's HGX is still a safe bet
Nvidia's latest Blackwell HGX carrier board on display at GTC24. - Click to enlarge
One of the biggest takeaways from Nvidia's Blackwell generation is that power and thermals matter. The more power you give these chips and the cooler you can keep them, the better they perform — up to a point.
If your facility is right on the edge of being able to support Nvidia's DGX H100, B100 shouldn't be any harder to manage, and, of the air-cooled systems, it looks to be the more efficient option, at least based on our estimates.
While the DGX B200 might not be as efficient at full load, it is still 28 percent faster than the B100 box. In the real world where chips seldom are running right up to the redline 24/7, the two may be closer than they look on paper.
In either case, you're still looking at a considerable improvement in compute density over Hopper. Four DGX B200 boxes are able to replace 9-18 H100 systems depending on whether or not you can take advantage of Blackwell's FP4 precision.
Fewer, denser racks point the way to a liquid cooled future
One of the bigger challenges datacenter operators are likely to face accommodating DGX B200 is the higher rack power density. With four boxes in a rack, we're looking at roughly 50 percent higher power and cooling requirements than the H100 systems.
If your datacenter can't support these denser configurations, then you may be forced to opt for two node racks, effectively eliminating any space savings Blackwell might have bought you. This might not be a big deal if your models haven't gotten any bigger or you can accommodate a longer training time and take advantage of Blackwell's capacious 192GBs HBM3e, but if your models have grown or your training or fine tuning time tables have shrunken, this could prove something of a headache.
The GB200 NVL72 is a rackscale system that uses NVLink switch appliances to stitch together 36 Grace-Blackwell Superchips into a single system. - Click to enlarge
The situation is a little different for the GB200 NVL72 range. More than 22 HGX H100 systems can be condensed into just one of these liquid cooled systems. Or put another way, in the space required to support one model, you can now support one 5.5x larger.
Having said that, doing so is going to require liquid cooling If you want to push Blackwell to its full potential.
The good news is many of the bit barns we've seen announce support for Nvidia's DGX H100 systems, including Equinix and Digital Realty, are already using a form of liquid cooling — usually using rear door heat exchangers — but DTC is becoming more common.
Some of these rear door configurations claim to support 100 or more kilowatts of heat rejection, so theoretically you could strap one of these to the NVL72 and dump that heat into the hot aisle. Whether or not your facility air handlers can cope with that is another matter entirely.
As such, we suspect that liquid-to-liquid CDUs are going to be the preferred means of cooling racks this dense.
Nvidia software exec Kari Briski on NIM, CUDA, and dogfooding AI
As AI booms, land near nuclear power plants becomes hot real estate
AI bubble or not, Nvidia is betting everything on a GPU-accelerated future
One rack. 120kW of compu
(read more)