What CIOs can do when the lights go out

Kathy Gibson reports from Data Centre Central – CIOs can do a lot to mitigate the effects of an unpredictable power supply – in fact, most data centres that function in the mission-critical realm know exactly what it is to run at least four hours out of every 72 hours without mains power.
“There is a cost to this,” says Lee Smith, director: data centre services and training at Dee Smith Consulting. “And this means there is a loss of funds.”
The last few months have been quiet on the power supply issue, but Smith cautions that this is not the time to be complacent.
“The demand for data centre space is growing, and these issues are going to come to a head again, clashing heads with unpredictability of power supply.”
Power outages are a reality and data centres are in the thick of it, Smith stresses. “Generators have run in anger longer than originally anticipated – and not everyone used continuous rating on their generators,”
This means fuel usage has increased, while static UPS batteries are also losing their capacity quicker.
The increased costs, means that budgets are being redirected to cover increased operations costs.
“This all takes train on the management of your data centre, and the service you provide your client through the data centre. Some organisation shave suffered severe impacts.”
The impact on the data centre starts with the increased cost of operations. The fuel cost is obvious, but water, sanitation, transport, time loss and logistics all have to be taken into account.
SLA (service level agreements) have not been adapted to reflect the hanging environment, failing to take into account the increase in fuel demand. A shortened duration between maintenance windows means costs increase even more.
There is a knock-on effect on increasing unscheduled maintenance which doesn’t just affect the equipment but people as well, leading to increased staff fatigue.
“Cheap equipment is becoming expensive to run,” Smith says. “The cheap equipment is not doing what is should under difficult circumstances.”
Data centre documentation is also rearing its head as a problem, and CIOs are finding it difficult to predict outcomes.
A global trend towards suitably skilled data centre staff just adds to the problem – in fact, Smith believes 1,5-million people could easily be absorbed into the market, particularly in data centre facilities.
Smith suggests the CIOs accept the fact that they can do nothing about the power supply, and not panic when the power does go down.
“You must prevent the loss of operations after the utility failure, minimise the potential for component failure, and minimise the impact of any components that do fail, Smith says.
It is also important to reduce the risk of human error within data centre operations, especially under emergency conditions.
“This means you need a well-trained and motivated data centre team. That is the best asset you can have in your data centre.”
To prevent sustainability issues, Smith suggests that CIOs look at what can fail, and assess the impact if it does fail. “How will you recover?” he asks.
“If you know that you can’t or won’t recover, you must understand the processes you need to invoke disaster recovery or business continuity – and communicate to all stakeholders.”
It’s important to identify and address any single points of failure, he says, conducting power audits and assessments – preferably by an outsider. This would include fuel reticulation from bunker to generator.
“Keep track of your needs and your current peak operational load – so you know in the event of a failure that you have enough to deal with the minimum need.”
CIOs need to assume they could lose an entire power distribution line. “There are many data centres, especially clients in co-location scenarios, who have only single power on their ICT equipment,” Smith says. “In that case, the equipment is going to fail if there are power outages. Make sure you and your clients understand the risk, and what needs to be done about it.”
Some organisations are paying a premium to fuel suppliers to ensure that they are first in line when emergency fuel suppliers are needed, so companies need to ensure they will get the fuel they need.
Smith suggests that CIOs monitor and measure the performance of all equipment, especially under emergency circumstances.
And he recommends that there are procedures in place to deal with any interruption for the data centre at a system and component level. “Under these conditions things can fail sooner than expected, so make sure you deal with emergency failures as well.”
It’s not just the impact of losing equipment, but the consequential impact that needs to be understood, with procedures in place to deal with it.
A highly-redundant data centre is not a guarantee against downtime, Smith warns. “Make sure that even a Tier IV data centre is not a guarantee – and there have been some spectacular failures.”
The level of operations, management and leadership is what will be the differentiator, he adds. “It doesn’t matter how good the data centre is. What’s important is the well-trained professionals and a consistent culture of improvement.”
Staff redundancy also needs to be considered. IF staff leave, or cannot get onsite because of other issues.
“It’s not what you have in your data centre environment – it’s what you do with it,” Smith stresses. “You have to continuously focus on what you have to do, and what needs to be improved. Complacency is the enemy.”
To mitigate risks, Smith suggests that all site documentation is evaluated – the processes, procedure, disaster recovery plan – and the re-instate to normal (RTN) plan. “Look to safety and security, with particular attention to levels of security when things are under pressure.”
Alarmingly, most documentation in data centres is either outdated, or cannot be found – and CIOs are urged to get their documentation in order.
They should also adapt processes and procedures as site conditions change, and should check or update data centre documentation at least annually.
“Also, consider the impact on the workforce in terms of continuity or redundancy,” Smith says.
“And what happens if the ‘guru’ cannot get to site in an emergency, or is away?’
A gap analysis around operational readiness should be conducted, Smith adds. “What do you assume you should be able to do; and what can you actually do?
“What support will you get if you ask for it. What is the gap that needs to be closed.”
A worst-case scenario study should be conducted. “Be honest,” says Smith. “Check SLAs and see what’s covered under what circumstance. Check your maintenance contracts and warranties.
“And ensure you are monitoring everything that is relevant in the data centre. And can you do this manually if you need to?”
Training an assessment is critical, Smith says. “The more people run through scenarios, the more confident people become, and everyone will benefit.”
The data centre does not operate in isolation, he adds. “It’s not just the data centre, but everything around it could have an impact on the organisation. If the rest of the distribution network is not capable of sustaining the business, there is no point in having a data centre.”
Critically, CIOs shouldn’t forget the water situation. “Consider what might happen if there is a long water outage: how long could the data centre survive and sustain your organisation and client interests? Do you have the ability to procure water provisioning if needed? What can you switch ff and what must stay running? And can you improve efficiencies in water utilisation?”

What CIOs can do when the lights go out

Related