The rise of direct-to-chip cooling as a top AI cooling system

When data center operators ask, “What are the top AI cooling systems for AI workloads in data centers today?” they are rarely looking for a product comparison. What they are really seeking is an architecture that reliably supports high-density AI at scale.

The two main cooling systems for AI workloads are direct-to-chip cooling and immersion cooling, both of which are available as single-phase or two-phase methods. Direct-to-chip cooling circulates coolant over CPUs and GPUs while immersion cooling consists of submerging IT equipment into a dielectric fluid for cooling.

Most liquid-cooled capacity in use today is single-phase direct liquid cooling. It’s expected to continue dominating thanks to advances in cold plate design as heat loads from next-generation accelerator chips continue to rise.

ai cooling system concept

The use of two-phase direct liquid cooling is likely to grow gradually, with adoption accelerating as chip-level thermal design power (TDP) and thermal flux exceed the practical limits of single-phase systems. Until then, deployments are likely to remain focused on pilots and early large-scale implementations. Immersion cooling, by contrast, is finding its place through selective adoption, where its architectural trade-offs are justified by performance or operational requirements.

As single-phase direct-to-chip cooling gains momentum, we’ll turn our focus to how this technology operates and the role it plays in supporting next-generation cooling technologies.

Explore the AI liquid cooling solutions resource site

How direct-to-chip cooling works

Direct-to-chip cooling is a closed-loop AI data center cooling system with cold plates that remove heat directly from GPUs and CPUs. The coolant is distributed at the rack level through manifolds and quick-disconnect interfaces that feed multiple servers.

At the center of the direct-to-chip is a Coolant Distribution Unit (CDU), which circulates the coolant while controlling temperature and pressure to prevent overheating. The CDU keeps the coolant and data center water separate to protect equipment from temperature and pressure variations in the facility’s water. However, the system ultimately rejects heat through the facility’s water infrastructure, leveraging outdoor cooling when temperatures are low—called “economization”—and switching to mechanical cooling otherwise.

Power distribution hardware, including the busway and PDUs, physically intersects with liquid routing, rack layouts, and service access. This makes power and cooling inseparable in high-density environments. Some retrofit environments leverage rear door heat exchangers (RDHx) alongside direct-to-chip systems in hybrid cooling environments to reduce room-level heat load and accelerate liquid readiness.

Why is direct-to-chip cooling dominating?

Direct-to-chip is the dominant GPU cooling solution because liquid removes heat more efficiently than air, allowing GPUs to sustain high utilization without thermal throttling. As rack power density increases, air cooling becomes impractical due to the energy and space requirements it entails.

Besides saving space, liquid cooling requires far less energy than the fans used in air cooling, which helps improve power usage effectiveness (PUE) in AI data centers. In addition, liquid cooling enables stable thermal conditions through sustained electrical loading without overprovisioning or derating.

This explains why AI data center reference designs increasingly assume liquid interfaces, predictable flow envelopes, and standardized rack distribution architectures. Physics, scalability, and operational predictability are driving the shift to direct-to-chip, not vendor preferences.

How should the power–cooling stack be designed?

In designing high-density rackcooling, there are some important considerations, starting with the fact that the chip’s power density in high-density workloads determines heat flux and cooling performance requirements. At the rack level, power distribution hardware competes for physical space with liquid cooling systems’ manifolds, hoses, and routing paths.

As operators design CDUs, they must account for sudden load swings caused by AI workloads. This means pump redundancy, heat exchanger capacity, filtration strategy, and control logic need to accommodate worst-case electrical loads and transient behavior. Electrical capacity planning determines how much heat the cooling system must continuously remove, which directly shapes thermal and redundancy requirements.

To drive energy efficiency, a facility’s cooling infrastructure should prioritize non-chiller economized heat rejection, resorting to mechanical cooling only in peak and contingency conditions. This isn’t possible by optimizing layer isolation; it requires an integrated cooling system.

What causes direct-to-chip cooling to fail or succeed?

Direct-to-chip deployments don’t always deliver the desired performance and reliability. The issue typically isn’t component failure but system-level mismatches. For instance, if the pumping capacity is inadequate, it can create a bottleneck over time as rack power increases. Other common issues are:

  • Insufficient heat exchanger margin, which creates instability during peak electrical loading and transient training events. The margin provides extra surface area to handle fluctuations.
  • Limited telemetry, preventing operators from correlating power draw with flow rates, temperature increases (ΔT), and return temperatures.
  • Electrical upgrades, which alter thermal behavior and require coordinated modeling and validation across the cooling and power domains.

It’s important to remember that a successful deployment treats hydraulic capacity, electrical growth, and observability as a single engineered envelope.

Next generation cooling technologies – An integrated power-and-thermal platform

The top strategy for AI-driven cooling optimization is an integrated power and thermal platform. Direct-to-chip liquid cooling has increasingly become the standard next-generation cooling technology for this integrated approach in high-density AI environments. Designed properly, a direct-to-chip cooling solution delivers long-term success through tightly integrated power delivery, thermal management, controls, and operations. As such, the ability to scale watts and heat together without introducing reliability or efficiency penalties provides a competitive advantage for data center operators. Explore different liquid cooling architectures to determine which approach works best for your AI workload needs.

Add a comment

All fields are required.