My name is Brian O’Connell. I work for VMware as part of the Integrated Systems Business Unit (ISBU) working on VMware Validate Designs (VVD) & VMware Cloud Foundation (vCF). I am an IT geek at heart and love playing with new technologies. I started my career in an IT services role as a Systems administrator for SMB’s and moved to EMC in 2010 where I worked for the EMC vLab team developing content for the ever expanding vLab cloud platform. I moved into working with the EMC Cloud Engineering Team primarily focused on the EMC Enterprise Hybrid Cloud Solution (EHC). I am also a husband to a beautiful wife, father to 2 beautiful daughters and a musician (or failed rock star depending on how you look at it!)
A few weeks back I mentioned on twitter that i was working on automating the VMware Validated Design NSX-V Distributed Firewall Configuration in my lab. (I admit it took longer than i had planned!) Currently this is a manual post deployment step once VMware Cloud Builder has completed the deployment. This will likely be picked up by Cloud Builder in a future release but for now its a manual, and somewhat tedious, but required, step!
Full details on the manual steps required for this configuration can be found here. Please take the time to understand what these rules are doing before implementing them.
So in an effort to make this post configuration step a little less painful i set out to automate it. I’ve played with the NSX-V API in the past and found it much easier to interact with by using PowerNSX, rather than leveraging PostMan and the API directly. PowerNSX is the unofficial, official automation tool for NSX. Hats off to VMware engineers Nick Bradford, Dale Coghlan & Anthony Burke for creating and documenting this tool. Anthony also published a FREE book on Automating NSX for vSphere with PowerNSX. More on that here.
Disclaimer: This script is not officially supported by VMware. Use at your own risk & test in a development/lab environment before using in production.
I’ve posted the script to GitHub here as its a bit lengthy! There may be a more efficient way to do some parts of it and if anyone wants to contribute please feel free!
As with a lot of the scripts i create it is menu based and has 2 main options:
Create DFW exclusions, IP Sets & Security Groups
Create DFW Rules
The reason i split it into 2 distinct operations is to allow you to inspect the exclusion list, IP Sets & Security Groups before creating the firewall rules. This will ensure that you dont lock yourself out of vCenter by creating an incorrect rule.
The script will check for PowerCli and if not found will attempt to install the latest version from the PowerShell Gallery
Currently tested on Windows only
If you dont have internet access you can manually install PowerCli by opening a PowerShell console as administrator and running:
I needed to do some validation around vRealize Operations Manager & vRealize Orchestrator for an upcoming VVD release and a physical lab environment was made available. The environment is a dual region VVD deployment. Upon verifying that I had access to all the components I needed it became obvious there was an issue with SSO in the primary region (SFO). Browsing to the web client for the SFO management vCenter I was seeing this:
Like any good (lazy!) IT person the first thing i did was google the error to find the quick fix! That led me to this communities post which had some suggestions around disk space etc. None of which were relevant to my issue. Running the following on the PSCs and vCenters showed that some services were not starting
Restarting the services didn’t help. Next up i checked the usual suspects:
All of the above looked ok. Next I turned my attention to the load balancer. Because the vCenter Web Client was inaccessible I was not able to access the load balancer settings through the UI so I turned to the NSX API using Postman
To connect to the NSX manager that is associated with the load balancer you need to configure a Postman session with basic authentication and enter the NSX manager admin user & password.
To retrieve information on the load balancer you need to run the following GET:
I wont post the full response from the above command as it’s lengthy but scanning through it I noticed that the condition of each load balancer pool member was disabled. In the immortal words of Bart Simpson:
Now I dont know how it got into this state – maybe someone was doing some jenga style doomsday testing, pulling one brick at a time until the tower crashes! – but this certainly looked to be the cause of the issue. So I figured the quickest fix would be to do a PUT API call to NSX with condition enabled for the pool members and I’d be all set. Not so easy!
Running the following PUT appears to work temporarily (running a GET at the same time confirms this)
But the change does not get fully applied and reverts the conditions to disabled after about 30 seconds with the below error:
So to apply the change to the load balancer NSX requires a handoff with the PSC that is is mapped to…in this case its the load balanced PSC that is not functional. So the command fails.
So it was clear I needed to get at least 1 PSC operational before i could attempt to make a change. Time to play with some DNS redirects to “fool” the PSC services into starting.
As my PSCs are setup in HA mode behind a load balancer the SSO endpoint URL is https://sfo01psc01.sfo01.rainpole.local which both PSCs will respond from. So to get my first PSC up I changed the IP for sfo01psc01.sfo01.rainpole.local in DNS to point to the first PSC’s IP.
So now, pings to the load balancer VIP FQDN sfo01psc01.sfo01.rainpole.local respond from the first PSC IP
Next I set a static entry in /etc/hosts on each of my PSCs, and vCenters to do the same as i’ve seen vCenter especially cache DNS entries in it’s local dnsmasq.
Next step was to stop & start all services on each PSC
service-control –stop –all
service-control –start –all
And hey presto the services started! Ran the same on vCenter and the services also started. This allowed me to go in and modify the load balancer pools to set the members to enabled.
Once the load balancer was back as it should be it was just a case of removing the /etc/hosts entries on each VM and reverting the DNS server change to point the load balancer FQDN back to its correct IP address.
For completeness I restarted all the services on each appliances in the above mentioned order
Moral of the story? Dont disable both nodes in a load balancer pool!
Now onwards with the original testing i needed to do!
vRealize Suite Lifecycle Manager (vRSLCM) is a one stop shop for lifecycle management (LCM) of your VMware vRealize Suite (vRA, vRB, vROPs, vRLI) . VMware Validate Designs leverages this via Cloud Builder for initial SDDC deployment but it also covers upgrade from a single interface, reducing the need to jump between interfaces by bringing all LCM tasks into a single UI. This doesn’t come without its challenges however, as vRSLCM is now responsible for aggregating all the install/upgrade logs and presenting them in a coherent manner to the user…which isn’t always the case. vRSLCM logs activity in /var/log/vlcm/vrlcm-server.log but at best you get something like this
Which let’s face it isnt very helpful…or is it? At first glance its just a job ID but thanks to @leahy_sin VMware CMBU I can now make this job ID give me more information in a much more structured way, similar to tail -f. Here’s how