VCF Recovery Models

Protection & more importantly, recovery of VMware Cloud Foundation (VCF) is something I and Ken Gould have worked closely on for a number of years now. Whether it was a VVD based deployment or in more recent years, a VCF deployment, we have tested, validated & documented the process & procedures you need to follow to be successful, should you need to recover from a failure.

When we talk about VCF Recovery Models, there are 3 main pillars.

Backup & Restore

This is the traditional scenario, where one or more components have failed and you need to recover/restore them in-place. The table below shows what is required for each VCF 9.0 component in the event of a single component failure. https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vvs/9-X/site-protection-and-disaster-recovery-for-vmware-cloud-foundation.html

The VCF 9.0 documentation for component-level backup & restore covers the manual steps required to configure backup for each component, and then how to recover/restore each component, and is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/backup-and-restore-of-cloud-foundation.html

VCF Instance Recovery

VCF Instance Recovery takes things a step further where an entire VCF Instance has failed and needs to be recovered. This could be due to a site outage (where a recovery site is not available), a catastrophic hardware failure, or a cyber attack. Many financial customers also have regulatory requirements like DORA (Digital Operational Resilience Act) where they must be able to demonstrate to the regulator that they can bring their critical infrastructure back in the case of a cyber attack. The table below shows what is required for each VCF 9.0 component in the event of an entire VCF Instance failure.

The VCF 9.0 documentation for VCF Instance Recovery leverages backups for the components that support it, and redeploy for those that don’t to bring the VCF instance back with the same identity as it had before it went down. We also provide a PowerShell module to automate many of the manual, potentially error-prone & onerous tasks to expedite the recovery time. The documentation is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/fleet-management/-vmware-cloud-foundation-instance-recovery.html

VCF Disaster Recovery

VCF Disaster Recovery is used to protect a VCF Instance across physical sites/regions. It requires a second, recovery VCF Instance available in another site/region with IP mobility between the instances. How you provide the IP mobility is up to you. As VMware, we would love you to use NSX Federation, but you can just as easily use fabric enabled stretched L2. For VCF 9.0 we use a combination of VMware Live Recovery & redeploy/restore from backup to recover the management components from the protected to the recovery sites. The same process can be used to recover business workloads.

The table below shows how each VCF 9.0 component is recovered on the recovery site in the event of an entire VCF site failure.

The VCF 9.0 documentation for VCF Disaster Recovery was published as a validated solution. It leverages live recovery replication and recovery plans for the components that support it, and redeploy & restore from backup for those that don’t. The documentation is available here https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vvs/9-X/site-protection-and-disaster-recovery-for-vmware-cloud-foundation.html

In the next major VCF release we are planning to bring these VCF Recovery Model pillars together so they are easier to find and compare side by side, and also add additional ones around ransomware recovery and lateral security to round out the VCF Protection & Recovery story. Credit to my colleague Tom Harrington for the colourful tables!

Upgrading VCF 5.2 to 9.0 – Part 8 – Deploy VCF Fleet Management Components

VCF 9.0 introduced the concept of VCF fleet, which is defined as:

An environment that is managed by a single set of fleet-level management components – VCF Operations & VCF Automation. A VCF fleet contains one or more VCF Instances and may contain one or more standalone vCenter instances, managed by the VCF Operations instance for the fleet. The management domain of the first VCF Instance in the VCF fleet typically hosts the fleet-level management components.

When deploying a new VCF fleet, you get the option to deploy the fleet-level management components using the VCF installer. Because I am upgrading from VCF 5.2, where I did not have Aria Operations or Aria Automation, I need to deploy new instances of each component (If I had pre-existing instances, they could be upgraded). You can deploy them manually from OVA, however, there is a new SDDC Manager API to automate the process using a JSON payload.

The API can be accessed via the SDDC Manager developer centre, under VCF Management Components.

The JSON payload to deploy VCF Operations (including a collector & the fleet management appliance) and VCF Automation is as follows: (NOTE: This spec is for a simple/single node deployment of the fleet management components where VCF Operations & VCF Automation will be deployed to an NSX Overlay segment, and the VCF Operations collector will be deployed to the management DVPG)

 {
    "vcfOperationsFleetManagementSpec": {
        "hostname": "flt-fm01.rainpole.io",
        "rootUserPassword": "VMw@re1!VMw@re1!",
        "adminUserPassword": "VMw@re1!VMw@re1!",
        "useExistingDeployment": false
    },
    "vcfOperationsSpec": {
        "nodes": [
            {
                "hostname": "flt-ops01a.rainpole.io",
                "rootUserPassword": "VMw@re1!VMw@re1!",
                "type": "master"
            }
        ],
        "useExistingDeployment": false,
        "applianceSize": "medium",
        "adminUserPassword": "VMw@re1!VMw@re1!"
    },
    "vcfOperationsCollectorSpec": {
        "hostname": "sfo-opsc01.sfo.rainpole.io",
        "rootUserPassword": "VMw@re1!VMw@re1!",
        "applianceSize": "small"
    },
    "vcfAutomationSpec": {
        "hostname": "flt-auto01.rainpole.io",
        "adminUserPassword": "VMw@re1!VMw@re1!",
        "useExistingDeployment": false,
        "ipPool": [
            "192.168.11.51",
            "192.168.11.52"
        ],
        "internalClusterCidr": "250.0.0.0/15",
        "vmNamePrefix": "flt-auto01"
    },
    "vcfInstanceName": "San Francisco VCF01",
    "vcfMangementComponentsInfrastructureSpec": {
        "localRegionNetwork": {
            "networkName": "sfo-m01-cl01-vds01-pg-vm-mgmt",
            "subnetMask": "255.255.255.0",
            "gateway": "10.11.10.1"
        },
        "xRegionNetwork": {
            "networkName": "xint-m01-seg01",
            "subnetMask": "255.255.255.0",
            "gateway": "192.168.11.1"
        }
    }
}

Validate your JSON payload using the POST /v1/vcf-management-components/validations API.

Executing this will return a task id. Copy this id to monitor the task

Check the status of the validation task using GET /v1/vcf-management-components/validations/{validationId} until it’s resultStatus is SUCCEEDED.

Now, submit the same JSON payload to POST /v1/vcf-management-components, and go grab a coffee!

Once the deployment completes, you should have a VCF Operations instance to manage your fleet, along with a VCF Automation instance for the consumption layer.

Upgrading VCF 5.2 to 9.0 – Part 7 – Upgrade vSphere Cluster

The next step in the upgrade sequence is to upgrade the vSphere cluster to 9.0.

Because the cluster is now managed by vLCM images, you need a vLCM image matching the target version you wish to upgrade to.

Log into the vSphere client and navigate to Menu > Lifecycle Manager, and click Create Image.
Give the image a name and select the correct target ESX version. Add any vendor/firmware/drivers you need and click Validate, and then Save.

To import the image to SDDC Manager, navigate to Lifecycle Management > Image Management and click Import Image.

Select the vCenter, select the image, and click Import.

Once the vLCM image is imported, navigate to Workload Domains > Management Workload Domain > Updates, and click Run Precheck and ensure all prechecks pass.

Once the pre-check passes, click Configure Update.

On the Introduction pane, review the details and click Next.

On the Select Clusters with Images pane, select the clusters to be upgraded, and click Next.

On the Assign Images pane, select the cluster, and click Assign Image.

On the Assign Image pane, select the desired image and click Assign Image, and click Next when returned to the Assign Images pane.

On the Upgrade Options pane, select the options you want and click Next.

On the Review pane, review the chosen options and click Run Precheck.

The vSphere cluster upgrade pre-check begins.

Once the pre-check completes, click Schedule Update.

On the review pane, review the settings and click Next.

On the Schedule Update pane, select your Maintenance Window, and click I have reviewed the hardware compatibility and compliance check result and have verified the clusters images are safe to apply, and click Finish.

The vSphere cluster upgrade begins

Once the upgrade completes, you can move on to the next steps.

Upgrading VCF 5.2 to 9.0 – Part 6 – Transition a vSphere Cluster from vSphere Lifecycle Manager Baselines to Images

The next step of the upgrade is to upgrade the vSphere clusters in the workload domain. VCF 9.0 no longer supports vSphere Lifecycle Manager Baselines (aka VUM) as a method of lifecycle managing your clusters. So if you have clusters that are managed using vSphere Lifecycle Manager Baselines, you must transition them to vSphere Lifecycle Manager Images. This can be done using the SDDC Manager API, following the documentation or (more suitable for larger scale) using the PowerShell script on this KB https://knowledge.broadcom.com/external/article?articleNumber=385617. I only have a single cluster, so I will use a mixture of manual and scripted steps.

The first step is to create a vLCM image that corresponds to the currently installed ESX version. In my case, I am running VCF 5.2.1 so the installed ESX version is ESXi 8.0 U3b – 24280767.

Log into the vSphere client and navigate to Home > Lifecycle Manager > Image Library.
Enter a name and under ESX Versions, select the version corresponding to your running ESX version.
If you require vendor add-ons, add them here.
Click Validate, and click Save.

Next, log in to SDDC Manager and navigate to Lifecycle Management > Image Management and click Import Image. Select the vCenter where you created the image, select the image from the list and click Import.

Select the source vCenter and image and click Import.

The image imports into the SDDC Manager inventory.

Now launch the PowerShell script from the KB

.\VcfBaselineClusterTransition.ps1

Choose option 1 to Connect to SDDC Manager and select vCenter. Enter the SDDC Manager FQDN and credentials and decide whether you want to save the credentials to a json file for future use.

Choose the vCenter you want to work against, or select all vCenter instances.

To Check existing cluster(s)’ vLCM image compliance, choose option 3. Enter a cluster id, and choose an image id to check against.

To Transition a vLCM baseline (VUM) cluster to vLCM image management, choose option 4. Enter a cluster id, and Confirm you have reviewed the image compliance findings. The transition process will begin.

The script will call the SDDC Manager APIs to transition the cluster from baselines to images.

Once the transition process completes, you can proceed with the next step of upgrading the vSphere cluster to vSphere 9.0.