Troubleshooting

Operations and actions on the agents side and on the server side of the solution are logged as well as notifications of main events.

Notifications

The notification page displays messages about the main events: some of them are informational in nature, but some require intervention from the customer or technical support:

Replication does not run longer than 2x RPO (recovery point objective) and less than 5x RPO (warning) - check working capability of replication agent and also open its console o verify that there are no error messages. The main reasons can be:
  • no connection with AWS or KVM platform. Check connection to AWS or KVM platform and fix the problem.
  • failure of one of critical agent services. Restart the agent and inform the support team about the problem to correct it.
  • the agent is turned off. Analyze the reasons for stopping the agent, if this happened by mistake, turn on the agent. If the agent has turned off on its own, restart it and inform the support team about the problem to fix it.

DR plan has not been updated for more than 2 months (warning) - it is recommended to update DR plans on a regular basis, otherwise it will take some time after disaster happens, which will directly affect the duration of business application downtime. The system will regularly notify to update the DR plan.

Replication does not start longer than 5x RPO (error) - check working capability of replication agent and also open its console and verify that there are no error messages. The main reasons can be:
  • no connection with AWS or KVM platform. Check connection to AWS or KVM platform and fix the problem.
  • failure of one of critical agent services. Restart the agent and inform the support team about the problem to correct it.
  • the agent is turned off. Analyze the reasons for stopping the agent, if this happened by mistake, turn on the agent. If the agent has turned off on its own, restart it and inform the support team about the problem to fix it.
Replication error (error) - check working capability of replication agent and also open its console and verify that there are no error messages. The main reasons can be:
  • no connection with AWS or KVM platform. Check connection to AWS or KVM platform and fix the problem.
  • failure of one of critical agent services. Restart the agent and inform the support team about the problem to correct it.
  • the agent is turned off. Analyze the reasons for stopping the agent, if this happened by mistake, turn on the agent. If the agent has turned off on its own, restart it and inform the support team about the problem to fix it.
There are no status updates from the agent for more than 5 minutes (error) - check working capability of replication agent and also open its console and verify that there are no error messages. The main reasons can be:
  • no connection with AWS or KVM platform. Check connection to AWS or KVM platform and fix the problem.
  • failure of one of critical agent services. Restart the agent and inform the support team about the problem to correct it.
  • the agent is turned off. Analyze the reasons for stopping the agent, if this happened by mistake, turn on the agent. If the agent has turned off on its own, restart it and inform the support team about the problem to fix it.

After fixing the problem, it should be marked in the UI as Solved.

Troubleshooting machines replication

Error message at VMware agent start. Check required amount of vCPU and RAM, minimum requirements: 2 vCPU and 4 GB RAM, recommended requirements: 4 vCPU and 8 GB RAM. When diagnosing problems, it may be useful to include detailed logging. For this purpose run the command touch /var/run/hvragent/debug_logs, to disable detailed logging rm -f /var/run/hvragent/debug_logs in the agent console. The main reasons can be:
  • no connection with AWS or KVM platform. Check connection to AWS or KVM platform and fix the problem.
  • failure of one of critical agent services. Restart the agent and inform the support team about the problem to correct it.
  • the agent is turned off. Analyze the reasons for stopping the agent, if this happened by mistake, turn on the agent. If the agent has turned off on its own, restart it and inform the support team about the problem to fix it.
  • attempt to run more than one agent for vSphere within the same ESXi host. Leave only one machine with the agent and restart it.

Machines do not appear in the interface after agent start. Check that agent is running and also open its console to verify that there are no error messages. Check connection to AWS or KVM platform. If the problem is not solved contact support team.

Machines do not go to the status Replicated. Check that agent is running on the host with the machine. Open its console and verify that there are no error messages. Check connection to AWS or KVM platform. Wait for several minutes and click Start replication. If the problem is not solved contact support team.

Machines do not change the status between Start/Stop replication. Check that agent is running on the host with the machine. Open its console and verify that there are no error messages. Check connection to AWS or KVM platform. Wait for several minutes and click Start/stop replication. If the problem is not solved contact support team.

Machine goes to the status Error. Check that agent is running on the host with the machine. Open its console and verify that there are no error messages. Check connection to AWS or KVM platform. Stop replication and after several minutes click Start replication. If the problem is not solved contact support team.

Troubleshooting migration

Cloud site is not created. Wait for several minutes and try again. If the problem repeats contact support team.

Machines in cloud site are in the status Active, but they are not available. Remove machines from cloud site and add them again via the menu “Add machine”. If the problem repeats contact support team: a specialist should analyze driver problems when the perating system boots.

Machines in cloud site are not in the Active status. If the Migration plan has rules of orchestration and dependence of machines from each, check that all machines of smaller rank (in turn) are running. Wait for 2-3 minutes to update the status. If the problem is not solved remove the machines from cloud site and add them again. If the problem repeats contact support team: a specialist should analyze driver problems when the operating system boots.

During failback to production and selected option “Upload the last state of machines in cloud site “the part of the machine does not work or some data is not displayed. The option “Upload the latest state of machines in cloud site” is unsafe and does not guarantee data consistency when testing failback to production. The point of this option is that the last state of machine in cloud site is uploaded to production. Data consistency is guaranteed when a machine is turned off in cloud site and uploading changes to production. To avoid the above problem, try to start synchronization process again or disable the option “Upload the latest state of machines in cloud site” by accessing failback agent to production and running the sync --safe command from the user, issued during downloading the agent. In this case, upload data to production from the last copy before the crash, configure production with it and then make synchronization with cloud site, pre-stopping there all the machines by running the sync –final command. After that, the problem should disappear and there will be consistent data in production.

At the last step with failback to production, when machines in cloud site are stopped and the latest changes are loaded to production, some machines do not work and are unavailable. Restart sync --final command. If the problem repeats contact support team. During problem investigation, it’s recommended to start machines in cloud site again and continue to work with them.

Recovered Linux system is not available:
  • not using lvm disk may cause change of disk name. In this case check /etc/fstab file. If there are names such as /dev/[sd|hd]*, change them to disk uuid.
  • network adapter name is changed. In this case use udev. In Debian/Ubuntu use udev rule /etc/udev/rules.d/70-persistent-net.rule. To get the network adapter name use mac-address.
Recovered RHEL / CentOS is not available. There are some reasons of problems with RHEL / CentOS start in cloud site.RHEL uses dracut system. By default, dracut-initramfs does not include qemu driver, that is why:
  • for RHEL / CentOS 7 run the command dracut -fMa qemu
  • for RHEL / CentOS 6 run the command dracut -f --add-drivers qemu

Logging

All logs are saved in the Elasticsearch platform and access to them is available at https://acura-logs:8080

Cloud provider administrators or customer administrators have access to product logs in case of on-premise installation. Also, the solution comes with a set of auxiliary utilities.

To separate global records from records that are specific to a particular user, set the filtering system.

Customer logs

To filter the logs of a specific customer, write the filter “customer: <customer_id>”. To get customer id, use the acura_debug utility, run it with the following parameters:

acura_debug <admin_user> <admin_password> list_customers

As the result, the customer table will be issued, including information about customer id.

Customer audit log

To get customer audit log write the filter "customer:<customer_id>&type:action". As the result, records will be displayed describing customer, user, token and action, performed on behalf of the client. Thus, all API calls to the system are logged.

System logs

System logs are available on the main page of Elasticsearch and does not require additional information.