Connect with me!

Have a question? Want to hire me? Reach out and Connect!
I'm available for remote and onsite consulting!
To live chat with me, Click Here!
vSAN

VMware vSAN – All VMs inaccessible after graceful cluster shutdown restart

When using VMware vSAN 7.0 Update 3 (7U3) and using the graceful shutdown (and restart) of your entire vSAN cluster, you may experience an issue resulting with all VMs inaccessible after everything has been powered back on and the hosts taken out of maintenance mode.

If you experience this issue, you will also notice that your vSAN datastore appears to be empty (files and VMs), however you can see that there is data used on the datastore (data usage calculation).

The Problem

As of vSAN 7.0 Update 3, users can now gracefully shutdown and restart their entire vSAN cluster from the GUI instead of having to use the CLI/SSH. While you can still Manually Shut Down and Restart the vSAN Cluster, as one can expect if there’s any easy way to do it via the GUI, it’ll get used.

Last night I had a customer call who used this feature, and when bringing up their cluster, all the VMs were marked as inaccessible and the datastore appeared to be empty. What was even more odd is that all the vSAN health information pertaining to the disks looked good.

Connecting to troubleshoot this (with my limited experience with vSAN), I attempted the following:

  • Restart vSAN Management Services on all ESXi Hosts
  • Restart vSAN Health Services on the vCenter vCSA (then wait 15 minutes and restart ESXi vSAN Manage Services)
  • Restart one of the ESXi hosts (to troubleshoot quorum)
  • Troubleshoot Networking (Issues occurred after physical maintenance)
    • Check MTUs
    • Check LAGs (for vSAN Storage Network)
    • Check Communication and Traffic

After doing all of the above, the VMs still were not accessible.

I had a feeling that this was related to the shutdown and restart (power on) process, so tried to manually start the vSAN cluster using the following command:

python /usr/lib/vmware/vsan/bin/reboot_helper.py recover

This command returned numerous tracebacks, and ultimately timed out after reporting:

Recovery is not ready, retry after 10s...

The Solution

I was convinced this was related to a bug in the automated scripts, so after adjusting my searching, I came across a VMware KB providing information on How to handle inconsistent cluster power status in vSAN shutdown workflow.

I was convinced this would help our issue, however the KB didn’t exactly describe the symptoms and errors we had. Scenario 3 was close, but symptoms were not exact.

At this point, I initiated a VMware Support ticket with VMware GSS, who after checking, confirmed it was the issue in the KB.

The Shutdown script sets “DOMPauseAllCCPs” to 1 (pausing all functions), and “IgnoreClusterMemberListUpdates” to 1. When you choose to Restart and Power on the cluster, these get set back to 0.

In our case, “IgnoreClusterMemberListUpdates” was set back to 0 during the restart and power on, however “DOMPauseAllCCPs” was still set to 1.

After setting DOMPauseAllCCPs” to “0” on all hosts, the VM’s were immediately accessibly, and the issue was resolved.

To check these variables:

esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

To set these variables (to undo what the shutdown script did):

esxcfg-advcfg -s 0 /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates

When checking or setting these, you must do it on all vSAN nodes (ESXi hosts) in the vSAN Cluster.

Stephen Wagner

Stephen Wagner is President of Digitally Accurate Inc., an IT Consulting, IT Services and IT Solutions company. Stephen Wagner is also a VMware vExpert, NVIDIA NGCA Advisor, and HPE Influencer, and also specializes in a number of technologies including Virtualization and VDI.

View Comments

  • HOLY MOLY this article saved my life (well, my infrastructure). You are a GODSEND. I would never have figured out "DOMPauseAllCCPs". That did the trick, after I had used the vcenter UI to shut down, but then lost the orchestration host and had to rebuild it. I tried the python script to recover, and it did not work! Once I set the CCPs option, presto, vSAN came back to life!!

    THANK YOU!

Share
Published by

Recent Posts

HPE Nimble and HPE Alletra 6000 SAN IP Zoning

Are you running an HPE Nimble or HPE Alletra 6000 SAN in your VMware environment with iSCSI? A commonly overlooked component of the solution architecture and configuration when using these… Read More

3 months ago

Procedure for Updating Enhanced Linked Mode vCenter Server Instances

You might ask if/what the procedure is for updating Enhanced Linked Mode vCenter Server Instances, or is there even any considerations that apply? vCenter Enhanced Link Mode is feature… Read More

3 months ago

NVIDIA vGPU Troubleshooting Guide – How to troubleshoot vGPU on VMware

In this NVIDIA vGPU Troubleshooting Guide, I'll help show you how to troubleshoot vGPU issues on VMware platforms, including VMware Horizon and VMware Tanzu. This guide applies to the full… Read More

3 months ago

vCenter OVF Import and Datastore File Access Issues

When using VMware vSphere, you may notice vCenter OVF Import and Datastore File Access Issues, when performing various tasks with OVF Imports, as well as uploading and/or downloading files from… Read More

3 months ago

HPE Simplivity Upgrade Manager – Access Denied, Incorrect Credentials

When attempting to log in to your VMware vCenter using the HPE Simplivity Upgrade Manager to perform an upgrade on your Simplivity Infrastructure, the login may fail with Access Denied,… Read More

5 months ago

Memory Deduplication on ESXi with Transparent Page Sharing

Today I want to talk about Memory Deduplication on ESXi with Transparent Page Sharing (TPS). This is technology that isn't widely known about, even amongst IT professionals with significant… Read More

6 months ago
Powered and Hosted by Digitally Accurate Inc. - Calgary IT Services, Solutions, and Managed Services