VCF Recovery Models

Protection & more importantly, recovery of VMware Cloud Foundation (VCF) is something I and Ken Gould have worked closely on for a number of years now. Whether it was a VVD based deployment or in more recent years, a VCF deployment, we have tested, validated & documented the process & procedures you need to follow to be successful, should you need to recover from a failure.

When we talk about VCF Recovery Models, there are 3 main pillars.

Backup & Restore

This is the traditional scenario, where one or more components have failed and you need to recover/restore them in-place. The table below shows what is required for each VCF 9.0 component in the event of a single component failure. https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vvs/9-X/site-protection-and-disaster-recovery-for-vmware-cloud-foundation.html

The VCF 9.0 documentation for component-level backup & restore covers the manual steps required to configure backup for each component, and then how to recover/restore each component, and is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/backup-and-restore-of-cloud-foundation.html

VCF Instance Recovery

VCF Instance Recovery takes things a step further where an entire VCF Instance has failed and needs to be recovered. This could be due to a site outage (where a recovery site is not available), a catastrophic hardware failure, or a cyber attack. Many financial customers also have regulatory requirements like DORA (Digital Operational Resilience Act) where they must be able to demonstrate to the regulator that they can bring their critical infrastructure back in the case of a cyber attack. The table below shows what is required for each VCF 9.0 component in the event of an entire VCF Instance failure.

The VCF 9.0 documentation for VCF Instance Recovery leverages backups for the components that support it, and redeploy for those that don’t to bring the VCF instance back with the same identity as it had before it went down. We also provide a PowerShell module to automate many of the manual, potentially error-prone & onerous tasks to expedite the recovery time. The documentation is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/-vmware-cloud-foundation-instance-recovery.html

VCF Disaster Recovery

VCF Disaster Recovery is used to protect a VCF Instance across physical sites/regions. It requires a second, recovery VCF Instance available in another site/region with IP mobility between the instances. How you provide the IP mobility is up to you. As VMware, we would love you to use NSX Federation, but you can just as easily use fabric enabled stretched L2. For VCF 9.0 we use a combination of VMware Live Recovery & redeploy/restore from backup to recover the management components from the protected to the recovery sites. The same process can be used to recover business workloads.

The table below shows how each VCF 9.0 component is recovered on the recovery site in the event of an entire VCF site failure.

The VCF 9.0 documentation for VCF Disaster Recovery was published as a validated solution. It leverages live recovery replication and recovery plans for the components that support it, and redeploy & restore from backup for those that don’t. The documentation is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vvs/9-X/site-protection-and-disaster-recovery-for-vmware-cloud-foundation.html

In the next major VCF release we are planning to bring these VCF Recovery Model pillars together so they are easier to find and compare side by side, and also add additional ones around ransomware recovery and lateral security to round out the VCF Protection & Recovery story. Credit to my colleague Tom Harrington for the colourful tables!

VMware Cloud Foundation 5.1 is GA!

While VMware Explore EMEA is in full swing, VMware Cloud Foundation 5.1 went GA! As of yesterday, you can now review the 5.1 design guide to see the exciting additions to VMware Cloud Foundation 5.1 is GA! Some of the main highlights are listed below.

Support for vSAN ESA: vSAN ESA is an alternative, single-tier architecture designed ground-up for NVMe-based platforms to deliver higher performance with more predictable I/O latencies, higher space efficiency, per-object based data services, and native, high-performant snapshots.
Non-DHCP option for Tunnel Endpoint (TEP) IP assignment: SDDC Manager now provides the option to select Static or DHCP-based IP assignments to Host TEPs for stretched clusters and L3 aware clusters.
vSphere Distributed Services engine for Ready nodes: AMD-Pensando and NVIDIA BlueField-2 DPUs are now supported. Offloading the Virtual Distributed Switch (VDS) and NSX network and security functions to the hardware provides significant performance improvements for low latency and high bandwidth applications. NSX distributed firewall processing is also offloaded from the server CPUs to the network silicon.
Multi-pNIC/Multi-vSphere Distributed Switch UI enhancements: VCF users can configure complex networking configurations, including more vSphere Distributed Switch and NSX switch-related configurations, through the SDDC Manager UI.
Distributed Virtual Port Group Separation for management domain appliances: Enables the traffic isolation between management VMs (such as SDDC Manager, NSX Manager, and vCenter) and ESXi Management VMkernel interfaces
Support for vSphere Lifecycle Manager images in management domain:VCF users can deploy management domain using vSphere Lifecycle Manager (vLCM) images during new VCF instance deployment
Mixed-mode Support for Workload Domains: A VCF instance can exist in a mixed BOM state where the workload domains are on different VCF 5.x versions. Note: The management domain should be on the highest version in the instance.
Asynchronous update of the pre-check files: The upgrade pre-checks can be updated asynchronously with new pre-checks using a pre-check file provided by VMware.
Workload domain NSX integration: Support for multiple NSX enabled VDSs for Distributed Firewall use cases
Tier-0/1 optional for VCF Edge cluster: When creating an Edge cluster with the VCF API, the Tier-0 and Tier-1 gateways are now optional.
VCF Edge nodes support static or pooled IP: When creating or expanding an Edge cluster using VCF APIs, Edge node TEP configuration may come from an NSX IP pool or be specified statically as in earlier releases.
Support for mixed license deployment: A combination of keyed and keyless licenses can be used within the same VCF instance.
Integration with Workspace ONE Broker: Provides identity federation and SSO across vCenter, NSX, and SDDC Manager. VCF administrators can add Okta to Workspace ONE Broker as a Day-N operation using the SDDC Manger UI.
VMware vRealize rebranding: VMware recently renamed vRealize Suite of products to VMware Aria Suite. See the Aria Naming Updates blog post for more details..
VMware Validated Solutions: All VMware Validated Solutions are updated to support VMware Cloud Foundation 5.1. Visit VMware Validated Solutions for the updated guides.

Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution

Along with the release of VMware Cloud Foundation 4.3.1, we are excited to announce the general availability of the Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution. The solution documentation, intro and other associated collateral can be found on the Cloud Platform Tech Zone here.

The move from VMware Validated Designs to VMware Validated Solutions has been covered by my team mate Gary Blake in detail here so I wont go into that detail here. Instead I will concentrate on the work Ken Gould and I (along with a supporting team) have been working to deliver for the past few months.

The Site Protection & Disaster Recovery for VMware Cloud Foundation Validated Solution includes the following to deliver an end-to-end validated way to protect your mission critical applications. You get a set of documentation that is tailored to the solution that includes: design objectives, a detailed design including not just design decisions, but the justifications & implications of those decisions, detailed implementation steps with PowerShell alternatives for some steps to speed up time to deploy, operational guidance on how to use the solution once its deployed, solution interoperability between it and other Validated Solutions, an appendix containing all the solution design decisions in one easy place for review, and finally, a set of frequently asked questions that will be updated for each release.

Disaster recovery is a huge topic for everyone lately. Everything from power outages to natural disasters to ransomware and beyond can be classed as a disaster, and regardless of the type of disaster you must be prepared. To adequately plan for business continuity in the event of a disaster you must protect your mission critical applications so that they may be recovered. In a VMware Cloud Foundation environment, cloud operations and automation services are delivered by vRealize Lifecycle Manager, vRealize Operations Manager & vRealize Automation, with authentication services delivered by Workspace ONE Access.

To provide DR for our mission critical apps we leverage 2 VCF instances with NSX-T federation between them. The primary VCF instance runs the active NSX-T global manager and the recovery VCF instance runs the standby NSX-T global manager. All load balancing services are served from the protected instance, with a standby load balancer (disconnected from the recovery site NSX Tier-1 until required, to avoid IP conflicts) in the recovery instance. Using our included PowerShell cmdlets you can quickly create and configure the standby load balancer to mimic your active load balancer, saving you a ton of manual UI clicks.

In the (hopefully never) event of the need to failover the cloud management applications, you can easily bring the standby load balancer online to enable networking services for the failed over applications.

Using Site recovery Manager (SRM) you can run planned migrations or disaster recovery migrations. With a single set of SRM recovery plans, regardless of the scenario, you will be guided through the recovery process. In this post I will cover what happens in the event of a disaster.

When a disaster occurs on the protected site (once the panic subsides) there are a series of tasks you need to perform to bring those mission critical apps back online.

First? Fix the network! Log into the passive NSX Global Manager (GM) on the recovery site and promote the GM to Active. (Note: This can take about 10-15 mins)

To cover the case of an accidental “Force Active” click..we’ve built in the “Are you absolutely sure this is what you want to do?” prompt!

Once the promotion operation completes our standby NSX GM is now active, and can be used to manage the surviving site NSX Local Manager (LM)

Once the recovery site GM is active we need to ensure that the cross-instance NSX Tier-1 is now directing the egress traffic via the recovery site. To do this we must update the locations on the Tier-1. Navigate to GM> Tier-1 gateways > Cross Instance Tier-1. Under Locations, make the recovery location Primary.

The next step is to ensure we have an active load balancer running in the recovery site to ensure our protected applications come up correctly. To do this log into what is now our active GM, select the recovery site NSX Local Manager (LM), and navigate to Networking > Load Balancing. Edit the load balancer and attach it to the recovery site standalone Tier-1.

At this point we are ready to run our SRM recovery plans. The recommended order for running the recovery plans (assuming you have all of the protected components listed below) is as follows. This ensures lifecycle & authentication services (vRSLCM & WSA) are up before the applications that depend on them (vROPS & vRA)

vRSLCM – WSA – RP
Intelligent Operations Management RP
Private Cloud Automation RP

I’m not going to go through each recovery plan in detail here. They are documented in the Site Protection and Disaster Recovery Validated Solution. In some you will be prompted to verify this or that along the way to ensure successful failover.

The main thing in a DR situation is, DO NOT PANIC. And what is the best way to getting to a place where you DO NOT PANIC? Test your DR plans…so when you see this…

Your reaction is this…

Trust the plan…test the plan…relax…you have a plan!

Hopefully this post was useful..if you want to learn more please reach out in the comments…if you’re attending VMworld and would like to learn more or ask some questions, please drop into our Meet The Experts session on Thursday.

Take a look at Ken’s post on the Planning & Preparation Workbook for this validated solution for more details.