Connect with me!

Have a question? Want to hire me? Reach out and Connect!
I'm available for remote and onsite consulting!
To live chat with me, Click Here!
vSAN

VMware vSAN – All VMs inaccessible after graceful cluster shutdown restart

When using VMware vSAN 7.0 Update 3 (7U3) and using the graceful shutdown (and restart) of your entire vSAN cluster, you may experience an issue resulting with all VMs inaccessible after everything has been powered back on and the hosts taken out of maintenance mode.

If you experience this issue, you will also notice that your vSAN datastore appears to be empty (files and VMs), however you can see that there is data used on the datastore (data usage calculation).

The Problem

As of vSAN 7.0 Update 3, users can now gracefully shutdown and restart their entire vSAN cluster from the GUI instead of having to use the CLI/SSH. While you can still Manually Shut Down and Restart the vSAN Cluster, as one can expect if there’s any easy way to do it via the GUI, it’ll get used.

Last night I had a customer call who used this feature, and when bringing up their cluster, all the VMs were marked as inaccessible and the datastore appeared to be empty. What was even more odd is that all the vSAN health information pertaining to the disks looked good.

Connecting to troubleshoot this (with my limited experience with vSAN), I attempted the following:

  • Restart vSAN Management Services on all ESXi Hosts
  • Restart vSAN Health Services on the vCenter vCSA (then wait 15 minutes and restart ESXi vSAN Manage Services)
  • Restart one of the ESXi hosts (to troubleshoot quorum)
  • Troubleshoot Networking (Issues occurred after physical maintenance)
    • Check MTUs
    • Check LAGs (for vSAN Storage Network)
    • Check Communication and Traffic

After doing all of the above, the VMs still were not accessible.

I had a feeling that this was related to the shutdown and restart (power on) process, so tried to manually start the vSAN cluster using the following command:

python /usr/lib/vmware/vsan/bin/reboot_helper.py recover

This command returned numerous tracebacks, and ultimately timed out after reporting:

Recovery is not ready, retry after 10s...

The Solution

I was convinced this was related to a bug in the automated scripts, so after adjusting my searching, I came across a VMware KB providing information on How to handle inconsistent cluster power status in vSAN shutdown workflow.

I was convinced this would help our issue, however the KB didn’t exactly describe the symptoms and errors we had. Scenario 3 was close, but symptoms were not exact.

At this point, I initiated a VMware Support ticket with VMware GSS, who after checking, confirmed it was the issue in the KB.

The Shutdown script sets “DOMPauseAllCCPs” to 1 (pausing all functions), and “IgnoreClusterMemberListUpdates” to 1. When you choose to Restart and Power on the cluster, these get set back to 0.

In our case, “IgnoreClusterMemberListUpdates” was set back to 0 during the restart and power on, however “DOMPauseAllCCPs” was still set to 1.

After setting DOMPauseAllCCPs” to “0” on all hosts, the VM’s were immediately accessibly, and the issue was resolved.

To check these variables:

esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

To set these variables (to undo what the shutdown script did):

esxcfg-advcfg -s 0 /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates

When checking or setting these, you must do it on all vSAN nodes (ESXi hosts) in the vSAN Cluster.

Stephen Wagner

Stephen Wagner is President of Digitally Accurate Inc., an IT Consulting, IT Services and IT Solutions company. Stephen Wagner is also a VMware vExpert, NVIDIA NGCA Advisor, and HPE Influencer, and also specializes in a number of technologies including Virtualization and VDI.

View Comments

  • HOLY MOLY this article saved my life (well, my infrastructure). You are a GODSEND. I would never have figured out "DOMPauseAllCCPs". That did the trick, after I had used the vcenter UI to shut down, but then lost the orchestration host and had to rebuild it. I tried the python script to recover, and it did not work! Once I set the CCPs option, presto, vSAN came back to life!!

    THANK YOU!

Share
Published by

Recent Posts

Deploy and install the New Teams for VDI

In this guide I'll show you how to deploy and install the new Teams for VDI (Virtual Desktop Infrastructure) client, and how to enable Teams Media Optimization on VMware Horizon.… Read More

2 months ago

Manage your NVIDIA vGPU Drivers with the NVIDIA GPU Manager for VMware vCenter

In May of 2023, NVIDIA released the NVIDIA GPU Manager for VMware vCenter. This appliance allows you to manage your NVIDIA vGPU Drivers for your VMware vSphere environment. Since the… Read More

4 months ago

VMware vSphere VM placement rules

When it comes to virtualized workloads, one thing I commonly see overlooked in the design of the solution, is the placement of workloads. In this post, I want to cover… Read More

4 months ago

ESXi 8.0 on HPE Proliant DL360p Gen8

A few months ago, you may have seen my post detailing my experience with ESXi 7.0 on HP Proliant DL360p Gen8 servers. I now have an update as I have… Read More

4 months ago

Hybrid Azure AD Join with Azure AD Connect for Non-Persistent VDI with VMware Horizon

With the release of VMware Horizon 2303, VMware Horizon now supports Hybrid Azure AD Join with Azure AD Connect using Instant Clones and non-persistent VDI. So what exactly does this… Read More

4 months ago

HPE Nimble and HPE Alletra 6000 SAN IP Zoning

Are you running an HPE Nimble or HPE Alletra 6000 SAN in your VMware environment with iSCSI? A commonly overlooked component of the solution architecture and configuration when using these… Read More

9 months ago
Powered and Hosted by Digitally Accurate Inc. - Calgary IT Services, Solutions, and Managed Services