Recovery process

Recovery process consists of preliminary stage of DR plans preparation, regular DR strategy testing and actions in case of disaster. “Actions in case of disaster” mean: configuration and launching cloud site, changes in running cloud site configuration, failback to production and finally, stopping cloud site.

Create and prepare DR plans

DR plans are scenarios for recovery process in case of disaster. They include description of devices (number of vCPU, RAM, rank, etc.) and networks. More details about rules for creating DR plans and their syntax can be found in the section ACP - Disaster Recovery plans.

image0

Availability of current DR plan is enough for quick recovery in case of disaster.

DR plans testing

Protection from disaster is not limited to one-time creation of DR plan, it is necessary to constantly check its relevance and carry out regular (recommended frequency: once every 3-4 weeks) DR strategy testing. To do this it is enough to create cloud sites from DR plan with certain regularity and to carry out a set of tests (adapted for customer infrastructure, a set of tests should be prepared as a part of DR strategy, customer and DR service provider should be responsible for preparation).

By testing results adjust DR plan (adding description of new devices, removing obsolete ones) and troubleshoot recovery process in case of problems in business application. For example: there may be a blue screen (BSOD) when one of the devices is started.

Testing steps:

  1. Update of DR plans.
  2. Create cloud site from current recovery point.
  3. Run a set of tests in running cloud site.
  4. Update and adjust the DR plan, troubleshoot problems.

Implementation of these steps allows to be prepared in advance for an accident and minimize problems.

Recovery at the time of the accident

At the time of the accident start cloud site (failover) and recover working capacity of business application on the backup site. Use previously prepared DR plan as recovery scenario.

Detailed description of recovery process configuration can be found in the section ACP - Recovery process.

image1

Recovery process may take some time: duration depends on structure complexity of DR plan and orchestration / dependency levels between components of business application. When all components go into the status Active, business application is ready for operation and it is possible to proceed to stage of transferring the main traffic to the backup site and setting up individual components.

Step of switching the main traffic to the DC is not part of the functionality of the solution and should be discussed with the service provider in advance.

It is recommended to use a simplified (to save time) set of tests, which will run in cloud site before switching production traffic to it.

Cloud site settings

After cloud site creation perform a number of additional actions with it to set business application: add machines to cloud site, stop / start machines. Cloud sites description and their settings can be found in the section ACP - Cloud sites.

image2

During DR strategy testing or after restoring the main site and moving changes to it, cloud site needs to be removed so that it does not take up resources.

Failback to production

When the main site is restored after accident it is necessary to failback business application to it with all changes that have been accumulated on the backup site since cloud site start and redirecting user traffic to it.

The process consists of:

  1. Downloading the agent with prepared DR plan, launching it on production environment to download changes from the last recovery point.
  2. Testing business application in production.
  3. Stopping devices in cloud site to finalize changes.
  4. Pumping changes from cloud site to production.
  5. Starting devices in production and redirect traffic to it.
  6. Protection of new production.

Detailed description of failback process to production can be found in the section Failback to production.

Completed these steps, business application will failback to production with all changes accumulated since redirection of traffic to the backup site.