Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution

Along with the release of VMware Cloud Foundation 4.3.1, we are excited to announce the general availability of the Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution. The solution documentation, intro and other associated collateral can be found on the Cloud Platform Tech Zone here.

The move from VMware Validated Designs to VMware Validated Solutions has been covered by my team mate Gary Blake in detail here so I wont go into that detail here. Instead I will concentrate on the work Ken Gould and I (along with a supporting team) have been working to deliver for the past few months.

The Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution includes the following to deliver an end-to-end validated way to protect your mission critical applications. You get a set of documentation that is tailored to the solution that includes: design objectives, a detailed design including not just design decisions, but the justifications & implications of those decisions, detailed implementation steps with PowerShell alternatives for some steps to speed up time to deploy, operational guidance on how to use the solution once its deployed, solution interoperability between it and other Validated Solutions, an appendix containing all the solution design decisions in one easy place for review, and finally, a set of frequently asked questions that will be updated for each release.

Disaster recovery is a huge topic for everyone lately. Everything from power outages to natural disasters to ransomware and beyond can be classed as a disaster, and regardless of the type of disaster you must be prepared. To adequately plan for business continuity in the event of a disaster you must protect your mission critical applications so that they may be recovered. In a VMware Cloud Foundation environment, cloud operations and automation services are delivered by vRealize Lifecycle Manager, vRealize Operations Manager & vRealize Automation, with authentication services delivered by Workspace ONE Access.

To provide DR for our mission critical apps we leverage 2 VCF instances with NSX-T federation between them. The primary VCF instance runs the active NSX-T global manager and the recovery VCF instance runs the standby NSX-T global manager. All load balancing services are served from the protected instance, with a standby load balancer (disconnected from the recovery site NSX Tier-1 until required, to avoid IP conflicts) in the recovery instance. Using our included PowerShell cmdlets you can quickly create and configure the standby load balancer to mimic your active load balancer, saving you a ton of manual UI clicks.

In the (hopefully never) event of the need to failover the cloud management applications, you can easily bring the standby load balancer online to enable networking services for the failed over applications.

Using Site recovery Manager (SRM) you can run planned migrations or disaster recovery migrations. With a single set of SRM recovery plans, regardless of the scenario, you will be guided through the recovery process. In this post I will cover what happens in the event of a disaster.

When a disaster occurs on the protected site (once the panic subsides) there are a series of tasks you need to perform to bring those mission critical apps back online.

First? Fix the network! Log into the passive NSX Global Manager (GM) on the recovery site and promote the GM to Active. (Note: This can take about 10-15 mins)

To cover the case of an accidental “Force Active” click..we’ve built in the “Are you absolutely sure this is what you want to do?” prompt!

Once the promotion operation completes our standby NSX GM is now active, and can be used to manage the surviving site NSX Local Manager (LM)

Once the recovery site GM is active we need to ensure that the cross-instance NSX Tier-1 is now directing the egress traffic via the recovery site. To do this we must update the locations on the Tier-1. Navigate to GM> Tier-1 gateways > Cross Instance Tier-1. Under Locations, make the recovery location Primary.

The next step is to ensure we have an active load balancer running in the recovery site to ensure our protected applications come up correctly. To do this log into what is now our active GM, select the recovery site NSX Local Manager (LM), and navigate to Networking > Load Balancing. Edit the load balancer and attach it to the recovery site standalone Tier-1.

At this point we are ready to run our SRM recovery plans. The recommended order for running the recovery plans (assuming you have all of the protected components listed below) is as follows. This ensures lifecycle & authentication services (vRSLCM & WSA) are up before the applications that depend on them (vROPS & vRA)

  • vRSLCM – WSA – RP
  • Intelligent Operations Management RP
  • Private Cloud Automation RP

I’m not going to go through each recovery plan in detail here. They are documented in the Site Protection and Disaster Recovery Validated Solution. In some you will be prompted to verify this or that along the way to ensure successful failover.

The main thing in a DR situation is, DO NOT PANIC. And what is the best way to getting to a place where you DO NOT PANIC? Test your DR plans…so when you see this…

Your reaction is this…

Trust the plan…test the plan…relax…you have a plan!

Hopefully this post was useful..if you want to learn more please reach out in the comments…if you’re attending VMworld and would like to learn more or ask some questions, please drop into our Meet The Experts session on Thursday.

Take a look at Ken’s post on the Planning & Preparation Workbook for this validated solution for more details.

vRealize Suite Lifecycle Manager Logs: The Easy Way

vRealize Suite Lifecycle Manager (vRSLCM) is a one stop shop for lifecycle management  (LCM) of your VMware vRealize Suite (vRA, vRB, vROPs, vRLI) . VMware Validate Designs leverages this via Cloud Builder for initial SDDC deployment but it also covers upgrade from a single interface, reducing the need to jump between interfaces by bringing all LCM tasks into a single UI. This doesn’t come without its challenges however, as vRSLCM is now responsible for aggregating all the install/upgrade logs and presenting them in a coherent manner to the user…which isn’t always the case. vRSLCM logs activity in /var/log/vlcm/vrlcm-server.log but at best you get something like this

GET http://localhost:8080/suite/status/1c4a2929-e09c-4a22-b9f1-2834ec1bd65c: 200 null

Which let’s face it isnt very helpful…or is it? At first glance its just a job ID but thanks to @leahy_s in VMware CMBU I can now make this job ID give me more information in a much more structured way, similar to tail -f. Here’s how

And now you should have some readable JSON, hopefully with some more info on the error you are hitting