Accounting for Risk in Your Data Center Design/Build Strategy

This audio was created using Microsoft Azure Speech Services.

This is the second of a series of blog posts exploring four key considerations that go into a data center design/build strategy: cost, risk, quality and speed. The first post focused on cost considerations while this one will tackle the risk considerations of a data center design/build strategy.

If we had an unlimited amount of time and money, we could eliminate most, if not all, risk.  The reality is we cannot afford to eliminate all risk; instead, we must manage risk so that it falls within our acceptable risk tolerance and financial constraints.

Data center risk comes in many forms – operational, financial, reputational and regulatory. The sources of data center operational risk can further break down into a number of areas, including:

  • Design risks (mechanical & electrical) – reliability, operability, maintainability
  • Location risks
    • Natural – hurricane, tornado, flood, earthquake
    • Man made – fuel storage depots, railroads, highways, airports
  • Outsourcing risks – design standard, contract terms and conditions, SLAs, maintenance practices, capital replacement program, financial stability

In this post, I’ll focus on data center operational risks stemming from the design of a data center.

Whenever you are assessing data center design risk, it’s important to reflect on the definition of risk.  Risk can be expressed as an equation: Risk = Probability x Impact.  For example, the probability of a utility outage is fairly high, perhaps occurring multiple times per year. However, the impact of an outage to a particular business varies based on a number of factors.  If the probability is high and the impact is high, it might be necessary to design a data center to protect the operation from utility outages.  If the probability is high and the impact is low because an IT application outage would not materially impact the business, or the IT applications have failover capability between two data centers, then there may be no need to provide protection against periodic utility outages.

In most cases, a data center needs to be protected at some level from mechanical and electrical failures.  Typically, if you design and build redundancy into your mechanical and electrical systems (i.e. generators, UPS systems and HVAC systems), you will reduce the frequency with which you experience an impact incident.  The challenge for every data center owner/occupant is determining how much redundancy is enough.  Redundancy comes at a cost, both in terms of capital expense and operating expense and, it can be difficult to quantify the benefit you will receive in the form of increased data center reliability given an increase in spend on redundancy.

The maintainability and serviceability of a data center is an important design attribute that is sometimes overlooked in the risk assessment process.  All mechanical and electrical systems will need to be taken offline for maintenance, repair or replacement at some point in their life cycle.  In such instances, the redundancy of your design is compromised and your risk profile increases.  Therefore, it’s important to assess the design and its ability to limit the mean time to repair.  A good design should allow you to isolate, replace, commission and place equipment or systems back on line in the shortest possible time.

You can’t rely on the design of the data center mechanical and electrical systems alone to manage operational risk.  While the reliability of a data center may go up with increased equipment redundancy, so does the amount of equipment and the complexity of the data center design.  Industry data suggests that the number one cause of impact incidents in a data center is human error.  Anytime a human is involved with mechanical and electrical switching activity in a data center, the risk of a data center incident increases.  More equipment typically means more maintenance, more break/fix activity, and increased operational complexity. With this in mind, simple is better. The less complex the system, the less chance an operator will make a switching error during routine maintenance or emergency break/fix situations.

Every business is unique, and so is their tolerance for risk.  Whenever decisions are being made about the design of a data center, they must be weighed against the operational risk tolerance as defined by all key stakeholders, including IT, facilities and executive management. Understanding your risk tolerance is extremely important because the cost of a data center increases proportionately, or perhaps exponentially, to the increase in resiliency, reliability and maintainability.  If you get this part wrong, you are subject to one of two outcomes: too little spend may expose you to operational incidents that financially impact your company while too much spend could increase your total cost of operation and hurt your competitive advantage.

Everybody strives to spend less and get more. But in this case you need to ask yourself if spending less will get you more risk.  The ultimate objective is to strike the right balance between facility resiliency and IT application resiliency in an effort to maximize the overall reliability and minimize the total cost of operation.


Tags: , , ,