DR needs a strategy for people, products, processes

If you want to protect your data in an emergency – whether it’s due to encryption from ransomware or after a natural disaster – you need to understand: disaster recovery (DR) and IT service continuity management (ITSCM) are indeed fully integrated into the technology and stakeholder ecosystem.

By Thomas Sandner, senior director: technical sales Germany at Veeam Software

However, DR requires a significant number of staff and time to be aligned and appropriately responsive to the needs of the business.

DR and ITSCM (as an important component of business continuity) are all about one thing: aligning business functionality with IT dependencies and then determining how those business processes can continue if IT systems are disrupted.

This doesn’t just involve server failover. It’s also about planning for the complexity of re-homing IT resources and ensuring that plans and processes are well documented and depend as little as possible on people. Because people – every employee and every CISO must admit – make mistakes, especially under stress.

Therefore, applications must be digitally resilient to maintain business processes. In turn, for applications to be resilient, servers must be recoverable, not just data. The process of doing this on alternative infrastructure (on-premises, off-premises or in the cloud) is not trivial and requires careful planning.

For servers to come back online on alternate infrastructure, three things are required:

* Orchestrated Workflows using a variety of detailed steps that are planned by humans but executed in an automated fashion.

* Routine Testing of these complex workflows using an isolated sandbox so as not to impact the production network and resources.

* Service Level Agreements (SLAs) that are monitored, as well as documentation of processes and readiness in the event of an emergency.

Human error-proneness must be preemptively compensated for

It often turns out that many people can’t meet these key requirements optimally and shouldn’t because of error-proneness. This is due to the following:

* Consistently performing detailed tasks, even under stress or across multiple different servers, often in times of crisis, such as a natural disaster or cyber-attack.

* Constant testing, because tests are often considered less important or subordinated to other tasks on a day-to-day basis.

* Monitoring and documentation are neglected because these tasks are judged to be even less urgent and important than testing.

Orchestration becomes more essential due to complex backup scenarios

There are two threats to data: Analog damage, such as fires, floods, earthquakes or sabotage, and digital damage, such as hacker attacks and ransomware. Modern orchestration to protect the network environment from the aforementioned disaster scenarios should always take place in close cooperation with the in-house IT architecture.

If this is not the case, we speak of so-called silo solutions that stand alone but are not aligned with other security solutions in the network environment and should therefore be avoided.

DR and business continuity, however, must be in harmony with existing IT security, and fortunately, most companies have recognized the importance of this synthesis: 82% of all companies have fully or mostly aligned their BC and DR measures with IT security, according to a recent study.

Given that 85% of organisations surveyed are hit by at least one ransomware attack annually and one in four servers fail at least once a year, perfectly aligned orchestration for emergencies is more important than ever.

The second question companies need to ask themselves is where their data will be recovered. The status quo here is split between two scenarios: 54% of all organisations1 restore their data on-premises, within the company network. Forty-six percent recover their data in the cloud – mostly provided by a hyperscaler solution, such as AWS, Azure, or Google Cloud and optionally via the use of virtual machines or automatic fallback to the backup servers in the event of a disaster.

As these two solutions are often combined, complexity increases and a well-planned, automated strategy that takes effect in the event of an emergency becomes essential. It’s impossible to manage by hand.

However, only 18% of all organisations have orchestrated workflows – the remaining 82% rely on (sometimes outdated), scripts or even recreate work processes manually. In the worst case, these already less reliable measures are not even subjected to regular tests to put their functionality to the test.

Disaster recovery is more than just an emergency exercise

This is exactly the difference between a backup routine “by the book” and DR as a reliable emergency plan.

There are good reasons why there are periodic alert drills in schools, nationwide warning days and siren tests: simulated crises put emergency plans to the test to see if they can protect what they are supposed to protect. Why should data protection be any different?

Accordingly, testing of DR plans must routinely occur in an automated fashion and be documented.

Here, organisations face two additional questions: how quickly should they be able to recover their own data in the event of a disaster (RTO – Recovery Time Objectives); and what level of data loss can be tolerated for recovery (RPO – Recovery Point Objective)?

Aiming for a value close to zero in both cases is not overambitious, but the only correct approach if you consider how high the costs of a business interruption can be: in 2021, IT managers estimated the costs of a business interruption at over 80 000 euros per hour.

A common misconception, especially among executives concerned about productivity, is that regular testing can impact day-to-day operations.

DR testing, however, whether scheduled or on-demand, can be performed in the background without disrupting employees. In general, orchestrating dedicated DR plans does not preclude accessibility: Both individual applications and large files can be restored with just one click using the right solution.

Moreover, the benefits of smooth orchestration let managers sleep easy before audits. Regularly tested, functioning and, above all, documented and therefore presentable recovery procedures are a criterion of excellence for auditors.

Compliance-compliant plans, in turn, close the loop to the previously set goals in terms of RPO and RTO: Those who can prove that their own disaster recovery plans recover data quickly and seamlessly will quickly get back on their feet in the event of a business failure and will also shine during annual audits.

Keeping up with the times

Orchestrated data protection – like economic or human resource goals – should be a part of any business strategy. Managing the digital and analog threats that data can face requires defined, automated, and orchestrated steps that eliminate human carelessness.

Testing in an isolated sandbox, in particular, is essential to determine whether internal countermeasures are doing their job despite changing circumstances. Manual processes and old scripts as an emergency solution must be a thing of the past so that data in the data center, in the cloud or in hybrid environments is protected seamlessly and is available at all times.