I’ve noticed in a few situations where an ESXi host is marked as “unresponsive” or “disconnected” inside of vCenter due to issues occurring on that host (or connected hardware). This recently happened again with a customer and is why I’m writing this article at this very moment.
In these situations, usually all normal means of managing, connecting, or troubleshooting the host are unavailable. Usually in cases like this ESXi administrators would simply reset the host.
However, I’ve found hosts can often be rescued without requiring an ungraceful restart or reset.
In these situations, it can be observed that:
- The ESXi host is in a unresponsive to disconnected state to vCenter Server.
- Connecting to the ESXi host directly does not work as it either doesn’t acknowledge HTTPS requests, or comes up with an error.
- Accessing the console of the ESXi host isn’t possible as it appears frozen.
- While the ESXi host is unresponsive, the virtual machines are still online and available on the network.
In the few situations I’ve noticed this occurring, troubleshooting is possible but requires patience. Consider the following:
- When trying to access the ESXi console, give it time after hitting enter or selecting a value. If there’s issues on the host such as commands pending, tasks pending, or memory issues, the console may actually respond if you give it 30 seconds to 5 minutes after selecting an item.
- With the above in mind, attempt to enable console access (preferably console and not SSH). The logins may take some time (30 seconds to 5 minutes after typing in the password), but you might be able to gain troubleshooting access.
- Check the SAN, NAS, and any shared storage… In one instance, there were issues with a SAN and datastore that froze 2 VMs. The Queued commands to the SAN caused the ESXi host to become unresponsive.
- There may be memory issues with the ESXi instance. The VMs are fine, however an agent, driver, or piece of software may be causing the hypervisor layer to become unresponsive.
If there are storage issues, do what you can. In one of the cases above, we had to access the ESXi console, issue a “kill -9” to the VM, and then restart the SAN. We later found out there was issues with the SAN and corrupted virtual machines. The moment the SAN was restarted, the ESXi host became responsive, connected to the vCenter server and could be managed.
In another instance, on an older version of ESXi there was an HPE agentless management driver/service that was consuming the ESXi hosts memory continuously causing the memory to overflow, the host to fill the swap and become unresponsive. Eventually after gracefully shutting down the VMs, I was able to access the console, kill the service, and the host become responsive.