Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution

Along with the release of VMware Cloud Foundation 4.3.1, we are excited to announce the general availability of the Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution. The solution documentation, intro and other associated collateral can be found on the Cloud Platform Tech Zone here.

The move from VMware Validated Designs to VMware Validated Solutions has been covered by my team mate Gary Blake in detail here so I wont go into that detail here. Instead I will concentrate on the work Ken Gould and I (along with a supporting team) have been working to deliver for the past few months.

The Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution includes the following to deliver an end-to-end validated way to protect your mission critical applications. You get a set of documentation that is tailored to the solution that includes: design objectives, a detailed design including not just design decisions, but the justifications & implications of those decisions, detailed implementation steps with PowerShell alternatives for some steps to speed up time to deploy, operational guidance on how to use the solution once its deployed, solution interoperability between it and other Validated Solutions, an appendix containing all the solution design decisions in one easy place for review, and finally, a set of frequently asked questions that will be updated for each release.

Disaster recovery is a huge topic for everyone lately. Everything from power outages to natural disasters to ransomware and beyond can be classed as a disaster, and regardless of the type of disaster you must be prepared. To adequately plan for business continuity in the event of a disaster you must protect your mission critical applications so that they may be recovered. In a VMware Cloud Foundation environment, cloud operations and automation services are delivered by vRealize Lifecycle Manager, vRealize Operations Manager & vRealize Automation, with authentication services delivered by Workspace ONE Access.

To provide DR for our mission critical apps we leverage 2 VCF instances with NSX-T federation between them. The primary VCF instance runs the active NSX-T global manager and the recovery VCF instance runs the standby NSX-T global manager. All load balancing services are served from the protected instance, with a standby load balancer (disconnected from the recovery site NSX Tier-1 until required, to avoid IP conflicts) in the recovery instance. Using our included PowerShell cmdlets you can quickly create and configure the standby load balancer to mimic your active load balancer, saving you a ton of manual UI clicks.

In the (hopefully never) event of the need to failover the cloud management applications, you can easily bring the standby load balancer online to enable networking services for the failed over applications.

Using Site recovery Manager (SRM) you can run planned migrations or disaster recovery migrations. With a single set of SRM recovery plans, regardless of the scenario, you will be guided through the recovery process. In this post I will cover what happens in the event of a disaster.

When a disaster occurs on the protected site (once the panic subsides) there are a series of tasks you need to perform to bring those mission critical apps back online.

First? Fix the network! Log into the passive NSX Global Manager (GM) on the recovery site and promote the GM to Active. (Note: This can take about 10-15 mins)

To cover the case of an accidental “Force Active” click..we’ve built in the “Are you absolutely sure this is what you want to do?” prompt!

Once the promotion operation completes our standby NSX GM is now active, and can be used to manage the surviving site NSX Local Manager (LM)

Once the recovery site GM is active we need to ensure that the cross-instance NSX Tier-1 is now directing the egress traffic via the recovery site. To do this we must update the locations on the Tier-1. Navigate to GM> Tier-1 gateways > Cross Instance Tier-1. Under Locations, make the recovery location Primary.

The next step is to ensure we have an active load balancer running in the recovery site to ensure our protected applications come up correctly. To do this log into what is now our active GM, select the recovery site NSX Local Manager (LM), and navigate to Networking > Load Balancing. Edit the load balancer and attach it to the recovery site standalone Tier-1.

At this point we are ready to run our SRM recovery plans. The recommended order for running the recovery plans (assuming you have all of the protected components listed below) is as follows. This ensures lifecycle & authentication services (vRSLCM & WSA) are up before the applications that depend on them (vROPS & vRA)

  • vRSLCM – WSA – RP
  • Intelligent Operations Management RP
  • Private Cloud Automation RP

I’m not going to go through each recovery plan in detail here. They are documented in the Site Protection and Disaster Recovery Validated Solution. In some you will be prompted to verify this or that along the way to ensure successful failover.

The main thing in a DR situation is, DO NOT PANIC. And what is the best way to getting to a place where you DO NOT PANIC? Test your DR plans…so when you see this…

Your reaction is this…

BookReview: Rest — Fresh Perception

Trust the plan…test the plan…relax…you have a plan!

Hopefully this post was useful..if you want to learn more please reach out in the comments…if you’re attending VMworld and would like to learn more or ask some questions, please drop into our Meet The Experts session on Thursday.

Take a look at Ken’s post on the Planning & Preparation Workbook for this validated solution for more details.

PowerShell Script to Configure an NSX-T Load Balancer for the vRealize Suite & Workspace ONE Access

As part of my role in the VMware Hyper-converged Business Unit (HCIBU) I spend a lot of time working with new product versions testing integrations for next-gen VMware Validated Designs and Cloud Foundation. A lot of my focus is on Cloud Operations and Automation (vROPs, vRLI, vRA etc) and consequently I regularly need to deploy environments to perform integration testing. I will typically leverage existing automation where possible and tend to create my own when i find gaps. Once such gap was the ability to use PowerShell to interact with the NSX-T API. For anyone who is familiar with setting up a load balancer for the vRealize Suite in NSX-T – there are a lot of manual clicks required. So i set about creating some PowerShell functions to make it a little less tedious and to speed up getting my environments setup so i could get to the testing faster.

There is comprehensive NSX-T API documentation posted on code.vmware .com that I used to decipher the various API endpoints required to complete the various tasks:

  • Create the Load Balancer
  • Create the Service Monitors
  • Create the Application Profiles
  • Create the Server Pools
  • Create the Virtual Servers

The result is a PowerShell module with a function for each of the above and a corresponding JSON file that is read in for the settings for each function. I have included a sample JSON file to get you started. Just substitute your values.

Note: You must have a Tier-1 & associated segments created. (I’ll add that functionality when i get a chance!)

PowerShell Module, Sample JSON & Script are posted to Github here

Automate your VMware Validated Design NSX-V Distributed Firewall Configuration

A few weeks back I mentioned on twitter that i was working on automating the VMware Validated Design NSX-V Distributed Firewall Configuration in my lab. (I admit it took longer than i had planned!) Currently this is a manual post deployment step once VMware Cloud Builder has completed the deployment. This will likely be picked up by Cloud Builder in a future release but for now its a manual, and somewhat tedious, but required, step!

Full details on the manual steps required for this configuration can be found here. Please take the time to understand what these rules are doing before implementing them.

So in an effort to make this post configuration step a little less painful i set out to automate it. I’ve played with the NSX-V API in the past and found it much easier to interact with by using PowerNSX, rather than leveraging PostMan and the API directly. PowerNSX is the unofficial, official automation tool for NSX. Hats off to VMware engineers Nick Bradford, Dale Coghlan & Anthony Burke for creating and documenting this tool. Anthony also published a FREE book on Automating NSX for vSphere with PowerNSX. More on that here.

Disclaimer: This script is not officially supported by VMware. Use at your own risk & test in a development/lab environment before using in production.

I’ve posted the script to GitHub here as its a bit lengthy! There may be a more efficient way to do some parts of it and if anyone wants to contribute please feel free!

As with a lot of the scripts i create it is menu based and has 2 main options:

  1. Create DFW exclusions, IP Sets & Security Groups
  2. Create DFW Rules

The reason i split it into 2 distinct operations is to allow you to inspect the exclusion list, IP Sets & Security Groups before creating the firewall rules. This will ensure that you dont lock yourself out of vCenter by creating an incorrect rule.

Required Software

  • PowerCli
    • The script will check for PowerCli and if not found will attempt to install the latest version from the PowerShell Gallery
    • Currently tested on Windows only
    • If you dont have internet access you can manually install PowerCli by opening a PowerShell console as administrator and running:
    • Find-Module -Name VMware.PowerCLI | Install-Module
  • PowerNSX
    • The script will check for PowerNSX and if not found will attempt to install the latest version from the PowerShell Gallery
    • Currently tested on Windows only
    • If you dont have internet access you can manually install PowerNSX by opening a PowerShell console as administrator and running:
    • Find-Module -Name PowerNSX | Install-Module

Required Variables

Before you can run the script you need to edit the User Variables to provide the following:

  • Target vCenter details
    • Required to establish a PowerCli Connection with vCenter Server. This is used when updating the DFW exclusion list
  • Target NSX Manager details
    • Required to establish a connection with NSX manager to configure the DFW
  • IP Addresses for the various SDDC components

Hopefully you will find this useful!

What not to do when your Platform Services Controllers are Load Balanced!

I needed to do some validation around vRealize Operations Manager & vRealize Orchestrator for an upcoming VVD release and a physical lab environment was made available. The environment is a dual region VVD deployment. Upon verifying that I had access to all the components I needed it became obvious there was an issue with SSO in the primary region (SFO). Browsing to the web client for the SFO management vCenter I was seeing this:

As i mentioned this is a VVD deployment and per VVD guidelines there are 2 Platform Services Controllers (PSCs) behind an NSX load balancer per region. Like so: (Diagram from the VMware Validated Design 5.0 Architecture & Design guide)

Like any good (lazy!) IT person the first thing i did was google the error to find the quick fix! That led me to this communities post which had some suggestions around disk space etc. None of which were relevant to my issue. Running the following on the PSCs and vCenters showed that some services were not starting

service-control –status

Restarting the services didn’t help. Next up i checked the usual suspects:

  • NTP
  • DNS
  • SSL Certificates

All of the above looked ok. Next I turned my attention to the load balancer. Because the vCenter Web Client was inaccessible I was not able to access the load balancer settings through the UI so I turned to the NSX API using Postman

To connect to the NSX manager that is associated with the load balancer you need to configure a Postman session with basic authentication and enter the NSX manager admin user & password.

To retrieve information on the load balancer you need to run the following GET:

https://sfo01m01nsx01.sfo01.rainpole.local/api/4.0/edges/edge-1/loadbalancer/config

I wont post the full response from the above command as it’s lengthy but scanning through it I noticed that the condition of each load balancer pool member was disabled. In the immortal words of Bart Simpson:




The response above is from a more targeted API call to /pools/pool-1.

Now I dont know how it got into this state – maybe someone was doing some jenga style doomsday testing, pulling one brick at a time until the tower crashes! – but this certainly looked to be the cause of the issue. So I figured the quickest fix would be to do a PUT API call to NSX with condition enabled for the pool members and I’d be all set. Not so easy!

Running the following PUT appears to work temporarily (running a GET at the same time confirms this)

But the change does not get fully applied and reverts the conditions to disabled after about 30 seconds with the below error:

So to apply the change to the load balancer NSX requires a handoff with the PSC that is is mapped to…in this case its the load balanced PSC that is not functional. So the command fails.

So it was clear I needed to get at least 1 PSC operational before i could attempt to make a change. Time to play with some DNS redirects to “fool” the PSC services into starting.

As my PSCs are setup in HA mode behind a load balancer the SSO endpoint URL is https://sfo01psc01.sfo01.rainpole.local which both PSCs will respond from. So to get my first PSC up I changed the IP for sfo01psc01.sfo01.rainpole.local in DNS to point to the first PSC’s IP.

So now, pings to the load balancer VIP FQDN sfo01psc01.sfo01.rainpole.local respond from the first PSC IP

Next I set a static entry in /etc/hosts on each of my PSCs, and vCenters to do the same as i’ve seen vCenter especially cache DNS entries in it’s local dnsmasq.

Next step was to stop & start all services on each PSC

service-control –stop –all

service-control –start –all

And hey presto the services started! Ran the same on vCenter and the services also started. This allowed me to go in and modify the load balancer pools to set the members to enabled.

Once the load balancer was back as it should be it was just a case of removing the /etc/hosts entries on each VM and reverting the DNS server change to point the load balancer FQDN back to its correct IP address.

For completeness I restarted all the services on each appliances in the above mentioned order

Moral of the story? Dont disable both nodes in a load balancer pool!

Now onwards with the original testing i needed to do!

NSX IPSec VPN between datacenters (multi site/region)

I’m doing some lab work with my team at the moment and we were gifted some hardware to do some multi region validation. Both systems (a VxRack SDDC & a VxRail) are in 2 separate datacenters, and both are using private IP addressing that is not routable between datacenters. As part of the validation we need both systems to be able to communicate with each other, however we dont control the inter lab switching to put in place the necessary routes to enable this. Rather than go through a change control process with the keepers of that gate we decided to get creative and have some fun (and hopefully learn something!) by setting up an NSX IPSec VPN between the labs.

Disclaimer: There are many better ways to do this for a permanent lab setup (i.e. BGP to the core with routes) but this was done on borrowed kit that was never initially designed with inter lab routing as a requirement, with no direct control on the inter lab switches, and we would also like to put it back the way we found it so dont want to make sweeping architectural changes!

Continue reading “NSX IPSec VPN between datacenters (multi site/region)”

vRA Network and Security Inventory Data Collection Failed

I’ve been playing around with Dell EMC RP4VM & vRA and needed to setup cross vCenter NSX in my lab. I’m not going to go into that setup as there are many blogs on the subject. What i will cover is an error i hit when trying to do Network and Security Inventory data collections on one of my NSX endpoints. The error from the Dem logs in vRA (Infrastructure > Monitoring > Log ) was as follows:


Workflow 'vSphereVCNSInventory' failed with the following exception:
'object' does not contain a definition for 'clusters'

After digging around for VMware KBs and blogs on the subject and coming up empty handed i went back to review my entire setup and discovered i had missed adding a vCenter cluster to the universal transport zone on the offending NSX endpoint, which is my DR site.

Once i rectified this the Network and Security Inventory Data Collection worked as expected.

Backup NSX Manager

Was playing around with NSX today and found that you can enable backup of the NSX manager from within the administration page. While using a backup solution like Avamar or VMware vSphere Data Protection (VDP) is preferred for backing up your VMs this is a quick and easy way to backup your NSX manager config that enables you to quickly restore in the case of losing the NSX manager (or screwing it up by changing something!!)

Continue reading “Backup NSX Manager”