AI data center design and deployment are moving at an incredible pace: 3 ways to approach a changing landscape

For the last 25 years, data centers for cloud and enterprise have been dominated by X86 “pizza box” general-purpose servers. The average power density began modestly and grew slowly from 3 kilowatts (kW) per rack to around 10 kW as their central processing units (CPUs) became more powerful. The average data center building had a useful life of at least 30 years and would experience many server refresh cycles. It was common to build a facility with extra white space for growth as server power and cooling requirements changed modestly.

Based on shipments of graphics processing units (GPUs) for AI server applications, I am forecasting that 60% of all servers being installed will support AI applications. These accelerated compute servers are real workhorses and have many GPUs (up to 16) with multiple CPUs and data processing units (DPUs). They require more power to operate and most require liquid cooling.

To make it even more challenging for data center operators (if that was not enough), the rate of evolution from GPU leaders like Nvidia is pretty much every year, with the power consumed almost doubling with each release.

Looking at the evolution of Nvidia AI GPUs and the average density per rack when configured in their DGX SuperPod designs is telling. In 2022, DGX SuperPods powered by A100 GPUs had an average power density of about 25 kW per rack. This figure climbed significantly in 2023 with the introduction of the H100, reaching approximately 40 kW per rack. The momentum continued in 2024, with the GH200 generation pushing densities to 72 kW per rack.

Looking ahead, Nvidia’s 2025 release of the GB200 is set to nearly double that again, bringing rack densities to around 132 kW. If projections hold, the 2026 VR200 generation will mark another leap, potentially reaching 240 kW per rack.

Power and cooling for the next generation of GPUs

If the speed of evolution was not enough, new extreme density levels present serious challenges with cooling and also power. The higher the density, the more challenging it is to design a hybrid liquid- and air-cooled data center (both are needed) at 132 kW/rack versus 10 kW/rack due to physical space, potential for overheating and maintaining resiliency and efficiency.

Remember, slow migration up to 10 kW/rack happened over decades and the designs were deployed, tested and optimized. We don’t have this luxury now. We also won’t have multiple IT refresh cycles as the power and cooling needed for the next generation of GPUs will be much higher and without significant upgrades to the power and power distribution, as well as no additional liquid cooling with special racks, manifolds, cooling distribution units and chillers.

Data centers that are optimized to use the latest and greatest GPUs must be designed in anticipation of the power density needed—a year or two in advance. Again, these will be fresh designs without the opportunity to test and optimize and most data center operators don’t have large staffs of designers to create new designs for every new generation of Nvidia.

Designing for accelerated compute: Practical strategies

GPUs, as used by data center designers, were not challenged in a significant way over the years, and expertise wasn’t needed or developed. For organizations approaching this fast-changing landscape, I suggest the following:

1. Simulate performance with digital twins.

With density challenges, it may be possible to do “paper” designs. However, these take longer, require an extremely talented engineer and the first testing can only occur after the physical deployment. Advanced design and simulation software can help make digital twins of the power system, cooling systems and the entire data center, if possible.

While this process is simpler, faster and more reliable, it’s important to note that the software requires training. It’s important to build confidence with the simulation of basic scenarios before progressing to more complex system models.

Electrical design software streamlines the design process by automating tasks. Providing precise calculations early in the design process can reduce errors and costs. With the help of simulation tools, you can gain a fundamental understanding of the expected performance and the effect on availability during different failure scenarios.

2. Start with proven reference designs.

Power and cooling infrastructure providers will release data center reference designs as a starting point for high-density AI data center deployment to accompany the Nvidia releases because of the extreme densities involved.

Publicly available on vendor websites in basic or very detailed engineering forms, these reference designs include special drawings, schematics, bills of material and performance specifications. They provide the base components necessary to streamline the design process. Local consulting engineers can easily adapt these designs and make changes to meet local regulations. This method is faster than designing yourself but not as fast as using prefabricated modules.

3. Accelerate builds with prefabricated modules.

With proper ordering and lead time consideration, prefabrication will be the fastest and most predictable deployment method. Prefabricated modules are available for the IT room, cooling and power systems.

These modules come in various capacities and sizes, and once the necessary site preparation is complete, they function as plug-and-play solutions. Built and tested in factories, they can reduce design time and costs while accelerating data center deployment. Although containerized and skid-mounted power and cooling modules have been used for some time, new prefabricated modules tailored for AI clusters are now emerging.

For example, an AI cluster module comes fully assembled—with racks, power busways, power distribution, liquid cooling connections and manifolds. To deploy it, users simply need to connect power and cooling sources and install the accelerated compute servers.

Improving deployment success

The design and deployment process for accelerated compute AI data centers differs from traditional data centers as the pace of AI chip development is on a lightning-fast cycle. This cycle is creating extreme challenges for companies that want to deploy the latest GPUs immediately as they must develop all new designs for the rising power densities.

As designers work through the design and operating process, leveraging a combination of these practices can help ensure optimal results.

This article was previously published in Forbes.

Add a comment

All fields are required.