The insatiable desire for AI—and the corresponding data centers that enable it—is predicted to need copious amounts of accelerated compute. Consider the eye-popping number of AI deployments planned, the data center developers scrounging for large amounts of grid power and the endless supplies of GPUs needed today and expected to be needed in the coming years.
We are still only at the beginning of this AI rollout, where the training of models is still the focus.
However, the focus is shifting toward optimizing the resources required for inference, which is when a pre-trained AI model makes predictions or decisions based on new, unseen data (rather than the data it was trained on). In other words, inference is when AI learns and acts based on data it encounters in use rather than on data that it is fed during training.
Optimizing this stage is most likely where the financial rewards of AI will materialize.
Inference is where value will be realized
Generative AI requires models to be trained with servers that are accelerator-based. In most cases, they leverage graphic processing units (GPUs) as accelerators. For example, Nvidia H100, H200 and GB200 or AMD MI300, Mi325, etc., are well-suited for training because they offer a large number of cores and high-bandwidth memory.
These GPUs run in parallel as one large GPU because that is the fastest and most efficient way to process massive amounts of data or parameters.
While training has been the focus, inference is where AI’s value is realized. Training clusters need large amounts of power. Optimized inference workloads that run over and over again on new data, on the other hand, should ideally use as few IT resources and power as possible.
Businesses will likely see process and automation improvements when verticalized inference workloads are put into action. The goal should be to develop and deploy highly optimized and streamlined IT stacks with compressed models.

The rush for AI inference workloads
A rush is on to get AI inference workloads into operation for larger players like Microsoft and AWS, but there’s less of a rush to deploy them on smaller, lower-power platforms. The reasons for this are valid, including:
1. The applications are not mature.
Pilot applications are being developed on large training clusters. The models need to be tested for speed, completeness and accuracy and be tuned before they can be deployed as production models.
Many of these AI inference applications are a “work in progress” and are nowhere close to being final models as the value they deliver may be limited. This means that it’s too early for many companies to deploy compressed models on optimized IT stacks that run more efficiently.
2. Inference is volatile, and the number of requests that hit the model is not usually linear and can be spiky.
If a model gets one request every two seconds on average, and it takes 15 seconds to handle each request and it is running at 100% utilization, a server with five or six GPUs may be needed. But most providers would want some buffer, so a server with eight GPUs would be more reasonable.
But what happens if a market event drives inquiries up to thousands a second? What happens if the inquiries are seasonal or vary dramatically at different times during the day? This volatility is why current inference AI workloads are, for the most part, being handled by AI IT clusters that were originally deployed for AI training and are located in large data centers.
These training clusters are “overkill” for many of today’s inference AI workloads and are not the most effective use of AI IT resources. For example, a training cluster could have 800 GPUs and the AI inference model from our example could potentially run on eight GPUs most of the time.
3. Deploying inference as a standalone at the edge will need significant backup time.
As AI is integrated into core business practices, it becomes business critical. Data centers already have redundant power, battery systems and emergency generators, so they are already equipped to handle power outages.
4. Inference is evolving fast.
Today’s AI can perform writing and coding fairly well—given enough input as to what the desired result needs to be. But newer AI models will use multi-modal inputs (not just text or voice), chain of thought reasoning for complex decision making and multi-phase planning with complex interconnected stages.
All of this will require more computing power to be deployed in accelerated compute AI clusters. Providers are using different ways to make these models more efficient for AI inference, including GPU sharing, reservations and workload scheduling.
Smaller verticalized applications will perform narrow functions and should, therefore, be able to run on less powerful IT while adding the needed value.
The evolution of inference AI
AI is at the beginning of its potentially world-changing cycle. As the IT industry goes through this inference development phase and the models start to add real value and benefits, it will be done on accelerated compute AI clusters in larger data centers.
In the future, I see a wave of progress where accelerated IT stacks will evolve, and associated power use can be optimized.
This article was previously published in Forbes.
Add a comment