Recovery process

Recovery process consists of a preliminary stage, which is a DR plan preparation, and of a regular DR strategy testing and actions in case of disaster. “Actions in case of disaster” stand for configuration and launching of a cloud site, changes of a running cloud site’s settings, failback to production and, finally, removal of the cloud site.

Create and prepare DR plans

DR plans are scenarios for recovery process in case of disaster. They include a description of machines (number of vCPU, RAM, rank, etc.) and networks. For more details on rules for creating DR plans and their syntax please refer to the section ACP - Disaster Recovery plans.

image0

The sheer existence of an up-to-date DR plan is enough for a quick recovery in case of disaster.

DR plans testing

Protection from disaster is not limited to a one-time creation of a DR plan. It is necessary to constantly check its relevance and carry out regular DR strategy testing with a recommended frequency of once in every 3-4 weeks. The following steps would be enough in order to perform this check:

  • create cloud sites from a DR plan with certain regularity;
  • carry out a set of tests (these should be prepared as part of a DR strategy and adapted for a customer infrastructure; customer and DR service provider are mutually responsible for their preparation).

Upon receiving test results, a DR plan is adjusted accordingly (by adding descriptions of new machines, removing obsolete ones) and a troubleshooting sequence of the recovery process is performed in case of issues with the customer’s business application. For example, there may be a blue screen (BSOD) when one of the machines is started.

Testing steps:

  1. Update a DR plan.
  2. Create cloud site from a current recovery point.
  3. Run a set of tests in a running cloud site.
  4. Update and adjust the DR plan, troubleshoot problems.

Implementation of these steps allows the customer to be prepared in advance for a disruptive event and minimize the related problems.

Recovery at the time of an accident

In case of an accident, start a cloud site (failover) to recover the working capacity of a business application on the backup site. Use the previously prepared DR plan as a recovery scenario.

A detailed description of recovery process configuration can be found in the section ACP - Recovery process.

image1

Recovery process may require some time: its duration depends on the structure complexity of a DR plan and orchestration / dependency levels between components of a business application. As soon as all components receive an Active status, the business application is ready for operation, and it is possible to proceed to the next stage, which includes transferring the main traffic to the backup site and setting up individual components.

Attention

Redirecting the main traffic to a DC is not part of the solution’s current functionality and should be coordinated with the service provider in advance.

For the sake of time-saving, it is recommended to use a simplified set of tests that will run in the cloud site before switching production traffic to the new target.

Cloud site settings

After cloud site creation, a number of additional actions can be performed in order to configure the runtime of a business application, such as adding machines to the cloud site, stopping / starting machines. Cloud sites description and the related settings are described in the section ACP - Cloud sites.

image2

During DR strategy testing or after restoring the main site and reconciling the accumulated changes, it is recommended to remove redundant cloud sites so that they do not take up spare resources.

Failback to production

Once the main site is restored after an accident, it is commonly required to failback the business application to its origin with all the changes that have been accumulated on the backup platform since the launch of the cloud site, and redirect the user traffic accordingly.

The process consists of:

  1. Downloading an agent with a prepared DR plan, running it in the production environment to download changes from the last recovery point.
  2. Testing business application in a production environment.
  3. Stopping machines in the cloud site to finalize the changes.
  4. Downloading changes from the running cloud site to the new production.
  5. Starting machines in the production and redirecting traffic accordingly.
  6. Protection of the new production.

A detailed description of a failback to production process can be found in the section Failback to production.

Once these steps are complete, a business application will failback to production with all the changes accumulated since the redirection of traffic to the backup site.