What the CrowdStrike outage has taught us

In today’s always-on cloud-driven technology environment, companies – and their customers – expect their IT services to run constantly. While no system is completely error-free, any outages or downtime should be measured in seconds, or minutes at the most.

By Richard Firth, COE of MIP Holdings

An outage lasting days or weeks is almost unheard of, and more than a week of downtime is not only unacceptable, but could put even the largest organisations out of business. The recent CrowdStrike outage is a perfect illustration of this, with Delta Airlines not only having to cancel about 7 000 flights over five days, but also facing an investigation from the US transportation department for the disruptions.

Estimates put the airline’s loss at around $500-million, excluding the cost of regulatory and legal action facing the company as a direct result of the outage. Delta wasn’t the only business affected, with banks and hospitals also having to deal with the repercussions of what some are calling the world’s largest IT outage.

According to Microsoft, 8,5-million Windows computers around the world crashed as a result of a bug in a CrowdStrike update, and it took 10 days for the company to fix the problem. It’s no wonder that the security software company is facing multiple lawsuits, one of which was launched by its own shareholders, who have accused CrowdStrike of making “false and misleading” statements about its software testing.

Delta’s CEO, Ed Bastian, has publicly faulted both CrowdStrike and Microsoft for failing to provide an “exceptional service”. Both tech companies have responded with declarations that they will be defending themselves “aggressively” and “vigorously” in the case of further legal action. Microsoft has tried to pass the responsibility back to Delta Airlines, saying its preliminary review suggested that Delta, unlike its competitors, apparently had not modernised its IT infrastructure.

Why Microsoft should stay in its lane

When we make use of cloud services, we trust those providers to follow thorough testing procedures before making changes to their infrastructure. If they don’t, a CrowdStrike scenario will inevitably happen. Microsoft trusted CrowdStrike to the point that they accepted updates pushed by CrowdStrike directly into their production Azure infrastructure.

While CrowdStrike was to blame for the fault, Microsoft should surely have had processes in place to implement things on “canary servers” before allowing them into production.

And the same should be true of any IT service. If you choose to outsource critical services to external providers, you expose yourself to the quality of their processes. If you choose to keep it in-house, you remain in control of the phases of roll-out to production. Of course, many people who did keep their stuff in-house still suffered – because they did not implement any “canary server” testing themselves.

While Microsoft has been happy to play the blame game with CrowdStrike, the reality is that the software giant has been pushing Office 365 into every type of business functionality it can, including mission-critical and customer-facing operations like billing services and call centres.

A situation like the CrowdStrike outage just highlights how short-sighted a complete reliance on Microsoft products can be for organisations that require more specialised and reliable solutions.

For years, companies have been increasingly buying into the Microsoft PR that the software giant can provide everything they need, but this has resulted in organisations placing all of their proverbial eggs in one basket.

This not only increases the risk of something going wrong, it increases the likelihood that solving a problem is harder to achieve when the solution is reliant on software developers in another time zone who may not have an understanding of the urgency or magnitude of an outage.

There’s no doubt that Microsoft excels in certain areas, but there is a reason that software companies like MIP exist, and that reason is their ability to design and develop solutions tailored to the specific needs of organisations. Using specialist solutions not only ensures that companies can provide uninterrupted service to their customers, but that security and other risks are minimised.

It’s all about skills

Unfortunately, Microsoft’s success has partly been as a result of the fact that there are very few software engineering companies that have the skills and capabilities to deliver specialised solutions to organisations like Delta Airlines.

In some cases, the lack of entrepreneurial skills in building IT platforms can only be seen in the ubiquity of out of the box solutions that require a lot of investment to get them to perform properly, but in others, this lack is causing difficulties in business processes, directly impacting how well companies can operate.

If more people had the development skills needed to create tailored solutions – and the skills to integrate them effectively with common programs like those offered by Microsoft, companies would have access to a broader variety of tools. This would not only ensure better recourse for companies dealing with any tech challenges, but would ensure that the technologies used were chosen to mitigate any risks.

Microservices, for example, would have ensured that the impact of the CrowdStrike outage was limited at every organisation affected, allowing companies to continue to operate while the problem was being fixed. Microservices would also have negated Microsoft’s complaint that Delta Airlines hadn’t modernised its IT environment, allowing for specific serves to be organised around business capabilities rather than infrastructure.

If the CrowdStrike outage proved anything, it’s that software development skills are more important than ever. In today’s technology driven world, everyone should have a programming or software engineering background – if only to be able to understand CrowdStrike’s explanation of what caused the outage – and how it intends to ensure this type of scenario never happens again.

Maybe the biggest lesson here is this: You can’t simply outsource everything and assume it will run perfectly. Ultimately, you remain responsible for your business operations, and if you choose to trust someone else to do something for you, you may be shifting some workload, but you cannot really shift responsibility. You should still be cautious. And if you take the risk of outsourcing, don’t cry when the risk materializes.