What not to do when your Platform Services Controllers are Load Balanced!

I needed to do some validation around vRealize Operations Manager & vRealize Orchestrator for an upcoming VVD release and a physical lab environment was made available. The environment is a dual region VVD deployment. Upon verifying that I had access to all the components I needed it became obvious there was an issue with SSO in the primary region (SFO). Browsing to the web client for the SFO management vCenter I was seeing this:

As i mentioned this is a VVD deployment and per VVD guidelines there are 2 Platform Services Controllers (PSCs) behind an NSX load balancer per region. Like so: (Diagram from the VMware Validated Design 5.0 Architecture & Design guide)

Like any good (lazy!) IT person the first thing i did was google the error to find the quick fix! That led me to this communities post which had some suggestions around disk space etc. None of which were relevant to my issue. Running the following on the PSCs and vCenters showed that some services were not starting

service-control –status

Restarting the services didn’t help. Next up i checked the usual suspects:

  • NTP
  • DNS
  • SSL Certificates

All of the above looked ok. Next I turned my attention to the load balancer. Because the vCenter Web Client was inaccessible I was not able to access the load balancer settings through the UI so I turned to the NSX API using Postman

To connect to the NSX manager that is associated with the load balancer you need to configure a Postman session with basic authentication and enter the NSX manager admin user & password.

To retrieve information on the load balancer you need to run the following GET:

https://sfo01m01nsx01.sfo01.rainpole.local/api/4.0/edges/edge-1/loadbalancer/config

I wont post the full response from the above command as it’s lengthy but scanning through it I noticed that the condition of each load balancer pool member was disabled. In the immortal words of Bart Simpson:




The response above is from a more targeted API call to /pools/pool-1.

Now I dont know how it got into this state – maybe someone was doing some jenga style doomsday testing, pulling one brick at a time until the tower crashes! – but this certainly looked to be the cause of the issue. So I figured the quickest fix would be to do a PUT API call to NSX with condition enabled for the pool members and I’d be all set. Not so easy!

Running the following PUT appears to work temporarily (running a GET at the same time confirms this)

But the change does not get fully applied and reverts the conditions to disabled after about 30 seconds with the below error:

So to apply the change to the load balancer NSX requires a handoff with the PSC that is is mapped to…in this case its the load balanced PSC that is not functional. So the command fails.

So it was clear I needed to get at least 1 PSC operational before i could attempt to make a change. Time to play with some DNS redirects to “fool” the PSC services into starting.

As my PSCs are setup in HA mode behind a load balancer the SSO endpoint URL is https://sfo01psc01.sfo01.rainpole.local which both PSCs will respond from. So to get my first PSC up I changed the IP for sfo01psc01.sfo01.rainpole.local in DNS to point to the first PSC’s IP.

So now, pings to the load balancer VIP FQDN sfo01psc01.sfo01.rainpole.local respond from the first PSC IP

Next I set a static entry in /etc/hosts on each of my PSCs, and vCenters to do the same as i’ve seen vCenter especially cache DNS entries in it’s local dnsmasq.

Next step was to stop & start all services on each PSC

service-control –stop –all

service-control –start –all

And hey presto the services started! Ran the same on vCenter and the services also started. This allowed me to go in and modify the load balancer pools to set the members to enabled.

Once the load balancer was back as it should be it was just a case of removing the /etc/hosts entries on each VM and reverting the DNS server change to point the load balancer FQDN back to its correct IP address.

For completeness I restarted all the services on each appliances in the above mentioned order

Moral of the story? Dont disable both nodes in a load balancer pool!

Now onwards with the original testing i needed to do!

vRealize Suite Lifecycle Manager Logs: The Easy Way

vRealize Suite Lifecycle Manager (vRSLCM) is a one stop shop for lifecycle management  (LCM) of your VMware vRealize Suite (vRA, vRB, vROPs, vRLI) . VMware Validate Designs leverages this via Cloud Builder for initial SDDC deployment but it also covers upgrade from a single interface, reducing the need to jump between interfaces by bringing all LCM tasks into a single UI. This doesn’t come without its challenges however, as vRSLCM is now responsible for aggregating all the install/upgrade logs and presenting them in a coherent manner to the user…which isn’t always the case. vRSLCM logs activity in /var/log/vlcm/vrlcm-server.log but at best you get something like this

GET http://localhost:8080/suite/status/1c4a2929-e09c-4a22-b9f1-2834ec1bd65c: 200 null

Which let’s face it isnt very helpful…or is it? At first glance its just a job ID but thanks to @leahy_s in VMware CMBU I can now make this job ID give me more information in a much more structured way, similar to tail -f. Here’s how

And now you should have some readable JSON, hopefully with some more info on the error you are hitting

 

VMware Validated Design – Automated Deployment with Cloud Builder – Part 6: Deploy The SDDC

This is part 6 of a series of posts on VMware Cloud Builder. 

In this final post, now that we have passed all validation, we will run the SDDC deployment using VMware Cloud Builder.

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 6: Deploy The SDDC”

VMware Validated Design – Automated Deployment with Cloud Builder – Part 5: Cloud Builder Deployment & Environment Validation

This is part 5 of a series of posts on VMware Cloud Builder.

Hopefully you’re still with me!

In this post I will cover the deployment and initial configuration of the VMware Cloud Builder appliance, ingestion of the deployment parameters file, and environment validation.

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 5: Cloud Builder Deployment & Environment Validation”

VMware Validated Design – Automated Deployment with Cloud Builder – Part 4: Generating SSL Certificates

This is part 4 of a series of posts on VMware Cloud Builder.

In this post I will cover generating the required SSL certificates for deploying this VMware Validated Design with VMware Cloud Builder.

Friendly warning: This is a long post so maybe get a coffee before reading!

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 4: Generating SSL Certificates”

VMware Validated Design – Automated Deployment with Cloud Builder – Part 3: Deployment Parameters File

This is part 3 of a series of posts on VMware Cloud Builder.

In this post I will cover the deployment parameters file.

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 3: Deployment Parameters File”

VMware Validated Design – Automated Deployment with Cloud Builder – Part 2: Environment Prerequisites

This is part 2 of a series of posts on VMware Cloud Builder.

In this post I will cover the initial environment prerequisites required before you can deploy your VMware Validated Design SDDC with Cloud Builder. These fall into 5 key areas:

  1. Prerequisites for Virtual Infrastructure Layer Implementation in Region A
  2. Prerequisites for Operations Management Layer Implementation in Region A
  3. Prerequisites for Cloud Management Layer Implementation in Region A
  4. Prerequisites for Business Continuity Layer Implementation in Region A
  5. Generate Certificates for the SDDC Components in Region A

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 2: Environment Prerequisites”

VMware Validated Design – Automated Deployment with Cloud Builder – Part 1: Overview

This is the first in a series of posts on VMware Cloud Builder – The automated deployment engine for VMware Validated Design – which delivers consistent and repeatable Software-Defined Datacenter (SDDC) deployments across your regions. Hopefully you will find it useful!

Continue reading “VMware Validated Design – Automated Deployment with Cloud Builder – Part 1: Overview”

VMware Validated Designs 5.0 – What’s New?

VMware Validated Designs is a complete set of prescriptive blueprints on how to deploy a VMware based Software-Defined Datacenter (SDDC). It includes Planning & Preparation guidance, detailed Architecture & Design documentation, design decisions – including justifications & implications for each decision – deployment guidance, upgrade guidance and now automated deployment. All of which is created by a team of VMware architects working behind the scenes with every VMware business unit…all with a view to ensuring that deploying the VMware SDDC is consistent & effortless for customers and partners.

Today saw the release of VMware Validated Designs 5.0. The documentation can be found here. I will delve into some of this in more depth in future posts but here are the highlights of today’s release

EDIT: I missed a major addition in the VMware Validated Designs 5.0 release – The new Document Map. This map provides guidance on how and when to navigate each document to make the documentation flow easier to consume

Continue reading “VMware Validated Designs 5.0 – What’s New?”

Beware VLAN double tagging!

In setting up some additional ESXi hosts in an aforementioned lab we ran into an issue where we could not communicate with the new hosts after setting static IPs and relevant management VLANs on them. The hosts are connected to 2 TOR switches (Cisco 9K Top Of Rack). Investigating on the switch you could see the hosts connected on the expected port on each switch (Ethernet 1/14 on each) by searching the mac address table for the relevant mac

Continue reading “Beware VLAN double tagging!”