While liquid cooling is rightly considered an emerging technology, it’s not new. Early IBM mainframes dating back the 1960s and Cray super computers featured liquid cooling. Interesting to note that the purchase price of the Cray included a full-time technician for installation operation and maintenance as part of the package.
Times have changed. Today generative AI is overhauling the way compute and data centers are designed. Accelerated compute servers incorporate 2-16 graphic processing units (GPUs) in every server as well as central processing units (CPUs) and even data processing units (DPUs). These servers are the most prolific number crunchers for training AI models and the most efficient but they can use more than 20 times the power of standard Intel based CPU cloud servers. And 20 times more power equals 20 times more heat output per server. That is so much heat that these servers can only be liquid cooled and come standard with input piping for cool liquid in and output piping for hot liquid out.
When fully loaded into a rack, the latest Nvidia based GPU server racks require 142 kW of power and the densities are only going higher. The current method and the best liquid cooling solution for AI data centers is called direct to chip or cold plate. In less than a year, the next evolution is scheduled for release and it will require around 240 kW per rack. But direct to chip, as the name implies, only cools a few components (the chips), not the other components in the chassis or rack, so air-based cooling must be used as supplemental cooling for the rest of the server and the data center. The supplemental cooling is not insignificant and can represent 20-30% of the cooling needs.
Cooling is a complicated architecture
If you run an enterprise company, no matter how large, or if you are an experienced data center operator, it’s unlikely you will have the in-house engineering expertise to design a hybrid cooling (liquid and air) approach at these extreme densities. Expertise is needed when it comes to hybrid cooling design, procurement, deployment, operation, and maintenance. Direct to chip requires multiple cooling loops to be built – one for the IT room and one for the heat rejection. Cooling distribution units (CDUs) manage both loops and are the point of interface between the two loops.
When it comes to picking a partner to design these cooling systems, you must find one that has experience with the cooling architectural components, such as the manifolds, piping, CDUs, chillers, pumps and cabinets. These components need to operate as a system. But sourcing components that can work together, programming them for operation, and later tweaking them for the highest performance is challenging. It is advised to leverage vendors that have experience with piping, fluids, pressure and flow rates to ensure reliable operation. Additionally, vendors like Schneider Electric, which has acquired Motivair, offer warranties and have certifications from GPU companies. Vendors offering experience and expertise that can show you warranties and certifications provide the best chance for a successful deployment – in the initial and ongoing phases.

The role of simulation and software
Due to the extreme densities, designing a system using a trial-and-error approach will dramatically extend the “time to cooling” and chances are not good that it will be successful. It’s desirable to pick a partner that uses digital twin modeling and simulations to prove their approach in the digital world before deploying in the physical world.
Additionally, preferred vendors will be working with leading GPU manufacturers. They will have conducted physical testing in the lab or at deployment sites. Schneider Electric collaborates with NVIDIA on reference designs for their DGX SuperPods prior to new releases of platforms incorporating the new generations of GPUs. Due to the complexity and challenge, Schneider also develops prefabricated cooling solutions like the IT Pod, which have already been tested and provide a faster and more predictable deployment for high density accelerated compute.
Downtime is not an option
At these high densities in an IT rack, any break in the liquid supply that is cooling the chips will result in “thermal throttling” and overheating in seconds. Redundancy must be included in the cooling distribution unit or CDU. For example, redundant pumps and dual power supplies should be the standard. In addition, immediate power back-up systems like uninterruptible power supplies must be used on the CDUs to insure continuous operation and transfer to longer-term, back-up solutions like generators. In addition, software for leak detection must be utilized in the white space of the data center as even a small leak can bring down an AI server or an entire AI cluster. AI enabled software must also be used proactively for risk mitigation by leveraging data from sensors throughout the cooling system for predictive analytics to identify patterns and potential trouble.
Optimization requires AI
Once the liquid cooling system is operational, best practice is to evaluate and improve efficiency and resource utilization as the precision of liquid cooling for AI workloads is critical. A few degrees too high will cause performance degradation of the GPUs and potentially and dramatically slow AI training and inference. AI software can be used to dynamically adjust cooling systems parameters like supply and return water temperatures and airflow and water flow in real time to match current demands. AI systems can even learn from operational feedback and continuously enhance cooling system performance.
Schneider Electric is leading the way in liquid cooling
GPU evolutions are coming at a staggeringly fast pace and putting tremendous stress on the ability for cooling vendors to achieve desired performance. When picking a vendor, you should ask about their solution roadmaps, as future compute GPU servers will have even higher thermal density, making deployment more challenging.
Schneider Electric has the experience: back in 2019, Motivair was the liquid cooling provider for Cray supercomputers with densities up to 400 kW per rack back. Yes, liquid cooling is an emerging mainstream technology to support accelerated compute. Companies that want to deploy AI will need to partner with an experienced vendor that is leading the way today and in the future. Explore how Schneider Electric can help you future-proof your AI data center with liquid cooling.
Add a comment