What to do when disaster strikes

Kathy Gibson reports from SMEXA 2015 – Organisations that don’t prepare for major incidents could end up having to deal with uncontrolled chaos.

This is the word from Thinking Dimensions’ Adriaan du Plessis, who thinks that incident management should be made a process within enterprises.

When service professionals think about incident management, they sometimes think too much about their processes – and forget about the unhappy customers.

“Think about the growth of online users, and the Internet of Things. You have more users using the outputs of IT,” says Du Plessis. “At the same time, the ability of the user to talk in IT terms is going down – so the gap between users and IT is getting bigger.

“The user doesn’t give a hoot about IT’s problems, So the pressure is not going to get better; it is going to get worse.”

As incidents become more sever, he adds, the customer experience becomes worse and the business becomes more unstable, Du Plessis adds.

The results that IT should try to avoid, therefore, starts with a long time to restore, with enormous pressure to fix.

“And we have to accept that, while we have very skilled people, we also hold opinionated views so we could have a lot of unnecessary debate among ourselves, and this could be destructive.”

There is a lot of stress, which needs to be avoided if possible. “And often, because we are desperate we go for trial and error.”

There are processes that should be followed in any incident, however, says Du Plessis.

“You should be able to focus on the trouble area, eliminate all the issues that are not important. “Unfortunately, unless we are continuously progressing someone will put a flame thrower on us, so we have to show progress.”

It’s also important to manage client expectations, and give them continuous feedback while aiming for an appropriate resolution in good time, Du Plessis adds.

“So incident management is a pretty structured process. You plan it up front, structure it – there is a way to do these things.”

The basic skills required for major incident management includes technical cause analysis to give you the ability to conduct a proper service recovery analysis in order to get the service back up to standard as quickly as possible. In the near future, a root cause analysis can be done to avoid the issue happening again.

“This is important, because the technical cause is why something happened, but the root cause is because someone did something or failed to do something.”

There are three roles in managing any major incident, Du Plessis says. “There has to be stream in there that works the restoration; a stream to find the cause; and there has to be management of the process.”

Unfortunately, technical knowledge alone will not go very far in managing and resolving a major incident. The first thing to do is contain the problem, then figure out a workaround, followed by analysis and then restoration.

“You want the team to find the technical cause, decide how to restore the service, manage the process priorities, and communicate this. Ensuring it doesn’t happen again is next week’s problem, but it must happen.”

Importantly there must be one person who manages each incident. “They need to get everyone together and define what he problem is.

“Then we can get someone to work on containment, and someone to get busy with a workaround. This means you need someone to lead the restoration function and someone to be the incident lead.”

Once these are done, the process to restore the service can begin, with someone charged with monitoring progress. Analysis, meanwhile, should go on in parallel and feed back to the restoration team.

“There is a case to be made for very direct roles in an incident, and very specific skills for people to function in a major incident.”

The main message, says Du Plessis, is for IT to think carefully when it devises internal roles and values for major incident management.