Why liquid cooling for AI data centers is harder than it looks

While liquid cooling for data centers is rightly considered an emerging technology, it’s not new. Early IBM mainframes from the 1960s and Cray supercomputers featured liquid cooling. Notably, a full-time technician was included in the Cray system purchase for installation, operation and maintenance.

Why AI is accelerating liquid cooling demand

Today, generative AI is reshaping the way compute and data centers are designed. Accelerated compute servers now incorporate two to 16 graphic processing units (GPUs) per server, alongside central processing units (CPUs) and even data processing units (DPUs). These servers are powerful number crunchers optimized for AI model training, but they consume over 20 times the power of standard Intel-based CPU cloud servers—and output 20 times more heat per server.

This heat output means these servers can only be liquid-cooled. Most now come standard with input and output piping for circulating liquid coolant.

Water drop to abstractly represent liquid cooling for AI data centers.

Managing heat: Power, density and design challenges

Rack thermal demands have surged alongside each new generation of GPU-accelerated servers. When fully loaded into a rack, the latest NVIDIA-based GPU servers require 132 kW of power—and densities continue to increase. The next generation, expected in under a year, will require 240 kW per rack.

The dominant cooling method is direct-to-chip, or cold plate cooling. However, as the name suggests, it cools only the chips—not the rest of the components in the chassis or rack. Because liquid systems cool only the chips, supplemental air cooling still covers 20% to 30% of the total thermal load.

Cooling is a complex architecture

Whether you’re a large enterprise or a seasoned data center operator, it’s unlikely you have the in-house expertise to design and deploy hybrid (liquid and air) cooling systems at these extreme densities. Specialized expertise is essential in designing, procuring, deploying, operating and maintaining such systems.

Direct-to-chip cooling systems require two separate cooling loops—one for the IT room, another for heat rejection. Cooling distribution units (CDUs) interface between the two. When designing these systems, select a partner experienced with the full cooling architecture: manifolds, piping, CDUs, chillers, pumps and cabinets.

These components must function together, requiring compatibility, integrated controls and performance tuning. Choose vendors familiar with piping, fluid dynamics, pressure and flow rates—and ideally, ones that offer warranties and have certifications from GPU manufacturers.

The role of simulation and software

Given the extreme heat densities, trial-and-error approaches will extend the “time to cooling” and reduce the odds of success. Choose a partner that uses digital twin modeling and simulation to validate the high density data center cooling system design virtually before deployment.

Vendors that work directly with GPU manufacturers have conducted lab testing or have proven deployments should be prioritized. Some vendors also offer pre-engineered and prefabricated cooling systems, which accelerate deployment and reduce risk.

Downtime is not an option

At these densities, even a brief interruption in liquid flow can lead to thermal throttling or overheating in seconds. CDUs must include redundancy—dual pumps and power supplies should be standard.

Uninterruptible power supplies must support CDUs to ensure continuity during transitions to backup systems or generators. Leak detection software is also critical in the data center’s white space; even a small leak can crash a server or cluster.

Optimization requires AI, too

Once operational, your liquid cooling system needs continuous tuning. Precision matters: even minor temperature increases can degrade GPU performance and slow down AI model training.

AI software can dynamically adjust cooling system parameters—like water temperatures, flow rates and airflow—in real time. These systems can even learn from operational data to optimize performance over time.

Choose vendors with an eye on the future

The pace of GPU evolution is placing intense demands on cooling vendors. When selecting a partner, ask about their technology roadmaps—can they support future generations of GPUs with even higher thermal densities?

Liquid cooling may still be categorized as “emerging,” but it is quickly becoming essential infrastructure. Companies aiming to scale AI must partner with vendors capable of supporting today’s and tomorrow’s liquid cooling requirements.

This article was originally published in Forbes.

Add a comment

All fields are required.