As physical infrastructure systems age beyond their warranties, software tools no longer reflect or comprehend reality, and operations & maintenance (O&M) programs grow outdated and/or become under-staffed, the risk of an interruption in service goes up significantly. Aging data centers must either be modernized or have its IT outsourced to cloud service and colocation providers to minimize the risk of business disruption. Remaining sites that delay modernizing also fail to benefit from recent technological advances. These improvements make data centers simpler, more efficient, easier to manage, and less expensive to operate today.
This blog lays out a simple 4-step framework for how to go about modernizing a facility. It begins first with developing performance standards. This is then used to perform a gap analysis identifying risks and needs. This approach, developed by a team of Solution Architects at Schneider Electric, should be used to cover the three key domains of data center modernization: (1) equipment hardware (electrical & mechanical), (2) software systems, and (3) operations & maintenance programs. Keeping the IT systems running depends on all three of these domains. So, it is critical that all are considered in a modernization project.
Four-step framework for modernization
4 Steps of the Data Center Modernization Framework
The four-step framework is a measured and methodical process to determine what to modernize in your facility, and how and when to do it. It is best applied holistically, across domains, but can be applied to a single domain within your data center as well.
- Develop performance standards — It is important to first start with documenting the specific objectives and goals of the modernization project. What do you want the data center to look like at the end of the project? How should it perform and what is needed to achieve that? It is useful to start with the larger business and IT objectives. These may well have changed since you first built the data center. The criticality and power capacity needs may have changed significantly as a result. Re-evaluating your needs in the context of today’s organizational objectives will help you figure out, for example, what level of electrical redundancy is really needed or what the operations team staffing levels should be at a given site.
A performance standard for each of the key domains should be written down and documented. If, for example, the decision is that the data center should meet a particular tier or criticality standard, then what it takes specifically to meet those requirements should be documented in the design standard. Make sure you have buy-in from all the key stakeholders and an understanding of what the IT outsourcing strategy is. An example performance standard is shown below for the power and cooling systems:
Example of performance standards
- Benchmark performance & identify health risks — With the performance standards clearly documenting in detail where you want to be, the next step is to evaluate what the current state of the data center is across all three domains and noting what is at risk of causing incidents. This involves physically investigating the infrastructure equipment and their interconnections. You want to understand each device’s age, maintenance contract status, load vs. capacity, etc. It means interviewing the O&M team and reviewing their methods of procedure and training documentation. You should not just rely on drawings or written reports. Data center infrastructure management (DCIM) tools should be checked against the equipment benchmark to see how well the software map of assets and their interconnections match reality. Use the performance standard documents as scorecards to record the current reality and to identify gaps in performance and health risks.
- Determine options to address gaps — With the current situation documented, the next step is to consider and document what it would take to bridge each of the gaps. Vendors and consulting engineers may be needed to clearly understand what your options are, as well as their costs. This effort will begin to form a picture of what it will take time, money, and labor-wise to achieve the project goals. This, in turn, could cause you to re-evaluate the design standards. And that’s OK, this is designed to be an iterative process.
In addition to benchmarking against your performance standard, a basic “health check” should be conducted to identify systems at risk. This health check should evaluate:
- age of devices and warranty status
- maintenance history and service contract status
- current load vs. capacity of systems
As devices age, they present greater downtime risks. Components are more likely to fail or require maintenance. Many devices as they age are also not under maintenance contracts.
- Prioritize actions based on cost, feasibility & risk — The last step before the actual implementation of upgrades and replacements begins, is to prioritize the actions needed to close the gaps to bring the data center to the performance levels spelled out in the design standards. Being a (presumably) mission-critical data center, all gaps need to be evaluated based on the amount of risk they represent to the continued functioning of the IT. For each gap uncovered in the audit, you must calculate the risk of not addressing it. Obviously, gaps with the biggest risk go to the top of your list of needs to focus on. This risk needs to be balanced against cost, time, how disruptive it might be to on-going operations, and any other objectives deemed important, such as energy efficiency goals.
Note there are 3rd party vendors who can assist you or even lead this evaluation process. Not only would they simplify and likely accelerate the process for you, but you would benefit from their having experience with many data centers. Also, their independence might make for a more accurate, unbiased judgement of what risks might exist in your facility.
Identify & Address the Basics (Low-hanging Fruit)
During the processes of creating the performance standards and benchmarking performance, you will likely uncover easy-to-fix issues, i.e., items involving relatively little to no CapEx and time to implement. These should be addressed right away, of course. Low-hanging-fruit actions we often see include:
- Power: conducting preventative maintenance (PM) services on units that are past due, removing unused power modules from UPSs, redistributing unbalanced loads, correcting mistakes in PDU/Rack PDU assignment if redundancy rules are found to be broken, etc.
- Cooling: conducting past-due PM services, adding blanking panels to racks, plugging holes in raised floors, removing obstructions from underfloor air pathways, making sure floor tiles are in the right places, making sure racks are aligned properly, etc.
- Operations: updating/correcting as-built drawings, ensure methods of procedure (MOPs) and emergency operating procedures (EOPs) are correct and in the right places, verify staff is properly trained on emergency procedures, etc.
- Software systems: reviewing and making sure all software tools have an accurate map of assets, resources and their dependencies are mapped correctly, and reviewing alarm thresholds and notification policies.
Following and adhering to this framework will simplify the process and reduce risk. It will optimize costs by focusing spending on process improvements, hardware upgrades, and replacements that have the biggest impact on reducing critical incidents and failures that can cause downtime of the IT systems and applications. And new business requirements may mean the infrastructure needed today may be much less than what you needed when it was first built. When you combine that with the likely efficiency gains that modern infrastructure and their management tools bring, the real total cost of ownership of a newly modernized facility is often less than you expected.
To learn more, download Schneider Electric white paper 272: A Framework for How to Modernize Data Center Facility Infrastructure.