When upgrading VMware ESXi hosts using VMware vCenter, and vLCM (vSphere Lifecycle Management), you may notice a failure to upgrade and remediate with vLCM and vGPU on ESXi.
This error appears in tasks as a general failure. Inside of vLCM when monitoring remediation, you’ll see an error in regards to a service, module, or VIB that is currently in use which blocks the update and/or upgrade.
Cause
I suspect this is occuring with vGPU release 18.3 (host driver 570.158.02) due to the fact that the host driver has a version change, however the GPU monitoring and management daemon does not (stays at 570.148.06). Since the GPU daemon isn’t touched, the services do not stop, which keeps the NVIDIA ESXi vGPU host driver loaded in the kernel, stopping the vLCM remediation from completeling.
Resolution
I tried a number of different things to resolve this, such as stopping services, re-attempting, then attempting to unload the NVIDIA vGPU kernel driver, however none of these provided a quick fix.
To resolve this issue, I stopped all the NVIDIA services, uninstalled the vGPU host driver and management daemon, restarted the host, checked compliance, and then remediated the host. Remediation completes succesfully.
Steps to perform these actions:
- Place the host in maintenance mode
- SSH in to the ESXi host
- Run the following command to identify the NVIDIA driver and GPU management daemon:
esxcli software vib list | grep -i NVD
- This will return the NVIDIA VIBs, example below:
NVD-VMware_ESXi_8.0.0_Driver
nvdgpumgmtdaemon
- Stop the NVIDIA vGPU and related services using the following commands (some of these may already be stopped):
/etc/init.d/nvdGpuMgmtDaemon stop
/etc/init.d/gpuManager stop
/etc/init.d/xorg stop
- Uninstall the NVIDIA vGPU Host Driver, and Management daemon using the following commands:
esxcli software vib remove -n NVIDIA-VMware_ESXi_8.0_Host_Driver
esxcli software vib remove -n nvdgpumgmtdaemon
- Reboot the host
- Check vLCM Compliance (don’t forget to skip this)
- Remediate the host
After performing these steps, you’ll be able to succesfully remediate the host resulting in upgraded NVIDIA vGPU drivers.