Jul 282023

In May of 2023, NVIDIA released the NVIDIA GPU Manager for VMware vCenter. This appliance allows you to manage your NVIDIA vGPU Drivers for your VMware vSphere environment.

Since the release, I’ve had a chance to deploy it, test it, and use it, and want to share my findings.

In this post, I’ll cover the following (click to skip ahead):

  1. What is the NVIDIA GPU Manager for VMware vCenter
  2. How to deploy and configure the NVIDIA GPU Manager for VMware vCenter
    • Deployment of OVA
    • Configuration of Appliance
  3. Using the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts

Let’s get to it!

What is the NVIDIA GPU Manager for VMware vCenter

The NVIDIA GPU Manager is an (OVA) appliance that you can deploy in your VMware vSphere infrastructure (using vCenter and ESXi) to act as a driver (and update) repository for vLCM (vSphere Lifecycle Manager).

In addition to acting as a repo for vLCM, it also installs a plugin on your vCenter that provides a GUI for browsing, selecting, and downloading NVIDIA vGPU host drivers to the local repo running on the appliance. These updates can then be deployed using LCM to your hosts.

In short, this allows you to easily select, download, and deploy specific NVIDIA vGPU drivers to your ESXi hosts using vLCM baselines or images, simplifying the entire process.

Supported vSphere Versions

The NVIDIA GPU Manager supports the following vSphere releases (vCenter and ESXi):

  • VMware vSphere 8.0 (and later)
  • VMware vSphere 7.0U2 (and later)

The NVIDIA GPU Manager supports vGPU driver releases 15.1 and later, including the new vGPU 16 release version.

How to deploy and configure the NVIDIA GPU Manager for VMware vCenter

To deploy the NVIDIA GPU Manager Appliance, we have to download an OVA (from NVIDIA’s website), then deploy and configure it.

See below for the step by step instructions:

Download the NVIDIA GPU Manager

  1. Log on to the NVIDIA Application Hub, and navigate to the “NVIDIA Licensing Portal” (https://nvid.nvidia.com).
  2. Navigate to “Software Downloads” and select “Non-Driver Downloads”
  3. Change Filter to “VMware vCenter” (there is both VMware vSphere, and VMware vCenter, pay attention to select the correct).
  4. To the right of “NVIDIA GPU Manager Plug-in 1.0.0 for VMware vCenter”, click “Download” (see below screenshot).
Screenshot of download link for NVIDIA GPU Manager for VMware vCenter
NVIDIA GPU Manager Download Page

After downloading the package and extracting, you should be left with the OVA, along with Release Notes, and the User Guide. I highly recommend reviewing the documentation at your leisure.

Deploy and Configure the NVIDIA GPU Manager

We will now deploy the NVIDIA GPU Manager OVA appliance:

  1. Deploy the OVA to either a cluster with DRS, or a specific ESXi host. In vCenter either right click a cluster or host, and select “Deploy OVF Template”. Choose the GPU Manager OVA file, and continue with the wizard.NVIDIA GPU Manager OVA Deploy
  2. Configure Networking for the Appliance
    • You’ll need to assign an IP Address, and relevant networking information.
    • I always recommend creating DNS (forward and reverse entries) for the IP.NVIDIA GPU Manager OVA Network Configuration
  3. Finally, power on Appliance.

We must now create a role and service account that the GPU Manager will use to connect to the vCenter server.

While the vCenter Administrator account will work, I highly recommend creating a service account specifically for the GPU Manager that only has the required permissions that are necessary for it to function.

  1. Log on to your vCenter Server
  2. Click on the hamburger menu item on the top left, and open “Administration”.
  3. Under “Access Control” select Roles. vCenter-Roles
  4. Select New to create a new role. We can call it “NVIDIA Update Services”.
  5. Assign the following permissions:
    • Extension Privileges
      • Register Extension
      • Unregister Extension
      • Update Extension
    • VMware vSphere Lifecycle Manager Configuration Priveleges
      • Configure Service
    • VMware vSphere Lifecycle Manager Settings Priveleges
      • Read
    • Certificate Management Privileges
      • Create/Delete (Admins priv)
      • Create/Delete (below Admins priv)
    • ***PLEASE NOTE: The above permissions were provided in the documentation and did not work for me (resulted in an insufficient privileges error). To resolve this, I chose “Select All” for “VMware vSphere Lifecycle Manager”, which resolved the issue.***
  6. Save the Role
  7. On the left hand side, navigate to “Users and Groups” under “Single Sign On”
  8. Change the domain to your local vSphere SSO domain (vsphere.local by default)
  9. Create a new user account for the NVIDIA appliance, as an example you could use “nvidia-svc”, and choose a secure password.
  10. Navigate to “Global Permissions” on the left hand side, and click “Add” to create a new permission.
  11. Set the domain, and choose the new “nvidia-svc” service account we created, and set the role to “NVIDIA Update Services”, and check “Propagate to Children”.
  12. You have now configured the service account.

Now, we will perform the initial configuration of the appliance. To configure the application, we must do the following:

  1. Access the appliance using your browser and the IP you configured above (or FQDN) GPU Manager Account Creation
  2. Create a new password for the administrative “vcp_admin” account. This account will be used to manage the appliance.
    • A secret key will be generated that will allow the password to be reset, if required. Save this key somewhere safe.
  3. We must now register the appliance (and plugin) with our vCenter Server. Click on “REGISTER”. NVIDIA GPU Manager Register
  4. Enter the FQDN or IP of your vCenter server, the NVIDIA Service account (“nvidia-svc” from example), and password.
  5. Once the GPU Manager is registered with your vCenter server, the remainder of the configuration will be completed from the vCenter GPU.
    • The registration process will install the GPU Manager Plugin in to VMware vCenter
    • The registration process will also configure a repository in LCM (this repo is being hosted on the GPU manager appliance).

We must now configure an API key on the NVIDIA Licensing portal, to allow your GPU Manager to download updates on your behalf.

  1. Open your browser and navigate to https://nvid.nvidia.com. Then select “NVIDIA LICENSING PORTAL”. Login using your credentials.
  2. On the left hand side, select “API Keys”.
  3. On the upper right hand, select “CREATE API KEY”.
  4. Give the key a name, and for access type choose “Software Downloads”. I would recommend extending the key validation time, or disabling key expiration. NVIDIA Download API Create Key
  5. The key should now be created.
  6. Click on “view api key”, and record the key. You’ll need to enter this in later in to the vCenter GPU Manager plugin.

And now we can finally log on to the vCenter interface, and perform the final configuration for the appliance.

  1. Log on to the vCenter client, click on the hamburger menu, and select “NVIDIA GPU Manager”.
  2. Enter the API key you created above in to the “NVIDIA Licensing Portal API Key” field, and select “Apply”.
  3. The appliance should now be fully configured and activated. GPU Manager Activated API Key
  4. Configuration is complete.

We have now fully deployed and completed the base configuration for the NVIDIA GPU Manager.

Using the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts

In this section, I’ll be providing an overview of how to use the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts. But first, lets go over the workflow…

The workflow is a simple one:

  1. Using the vCenter client plugin, you choose the drivers you want to deploy. These get downloaded to the repo on the GPU Manager appliance, and are made available to Lifecycle Manager.
  2. You then use Lifecycle Manager to deploy the vGPU Host Drivers to the applicable hosts, using baselines or images.

As you can see, there’s not much to it, despite all the configuration we had to do above. While it is very simple, it simplifies management quite a bit, especially if you’re using images with Lifecycle Manager.

To choose and download the drivers, load up the plugin, use the filters to filter the list, and select your driver to download.

GPU Manager downloading vGPU Driver
NVIDIA GPU Manager downloading vGPU Driver

As you can see in the example, I chose to download the vGPU 15.3 host driver. Once completed, it’ll be made available in the repo being hosted on the appliance.

Once LCM has a changed to sync with the updated repos, the driver is then made available to be deployed. You can then deploy using baselines or host images.

LCM Image Update with NVIDIA vGPU Driver from NVIDIA GPU Manager
LCM Image Update with NVIDIA vGPU Driver from NVIDIA GPU Manager

In the example above, I added the vGPU 16 (535.54.06) host driver to my clusters update image, which I will then remediate and deploy to all the hosts in that cluster. The vGPU driver was made available from the download using GPU Manager.

Jul 252023

When it comes to virtualized workloads, one thing I commonly see overlooked in the design of the solution, is the placement of workloads. In this post, I want to cover VMware vSphere VM placement rules using the “VM/Host Rules” feature.

This is a feature that I commonly see overlooked and not configured, especially in smaller single cluster environments, however I’ve also seen this happen in very large scale environments as well.

Let’s cover the why, what, who, and how…

VM Workloads

While VMware vSphere does have a number of technologies built in for redundancy, load-balancing, and availability, as part of the larger solution we often find our workloads, specifically 3rd party platforms, with their own solutions that accomplish the same thing.

We need to identify which HA (High Availability) or redundancy solution to use, based on the application, service, and how it works.

For example, using VMware vSphere HA, or High Availability, if vCenter (and/or vCLS) detects a host goes offline, it can restart the workload on other online hosts. There is time associated with the detection and boot time, resulting in a loss of service during this period.

Third party solutions often have their own high availability or redundancy built in to the solution, such as Microsoft Active Directory. In this case with a standard configuration, at any time, any domain controller can respond to a clients request for resources. If one DC goes offline, other DCs can respond to the request resulting in no downtime.

Obviously, in the case of Active Directory Domain Controllers, you’d much prefer to have multiple DCs in your environment, instead of using one with vSphere HA.

Additionally, if you did have multiple domain controllers, you’d want to make sure they aren’t all placed on the same ESXi host. This is where we start to incorporate VM placement in to our solution.

VM Placement

When it comes to 3rd party solutions like mentioned above, we need to identify these workloads and factor them in to the design of the solution we are either implementing, maintaining, or improving.

Example of VM workloads used with VM Placement

A few examples of these workloads with their own load-balancing and availability technologies:

  • Microsoft Windows Active Directory Domain Controllers
  • Microsoft Windows Servers running DNS/DHCP Servers
  • Virtualized Active/Active or Active/Passive Firewall Appliances
  • VMware Horizon UAG (Unified Access Gateway) configured in HA mode
  • Other servers/services that have their own availability systems

As you can see, the applications all have their own special solution for availability, so we must insure the different “nodes” or “instances” are running on different ESXi hosts to avoid a host failure bringing down the entire solution.

Unless otherwise specified by the 3rd party vendor, I would recommend using VM/Host Rules in combination with vSphere DRS and HA.

Configuring VM Placement with VM/Host Rules

To configure these rules, follow the instructions below:

  1. Log on to your VMware vCenter Server
  2. Select a Cluster
  3. Click on the “Configure” tab, and then “VM/Host Rules”
    • Here you can Add/Edit/Delete VM Host Rules
  4. Click on “Add”, and give the rule a new name (Example: Domain Controllers)
  5. For “Type”, select “Separate Virtual Machines”
  6. Click “Add” and select your Domain Controllers and add them to the rule.
Screenshot rule creation for VM placement using VM Host Rules
Domain Controller VM Placement VM Host Rule

After you click “OK”, the rule should now be saved, and DRS will make sure these VMs are now running on separate hosts.

Below you can see another example of a configured system, separating 2 Active/Passive Firewall appliances.

VM placement and VM/Host Rules for Firewall appliances
VM/Host Rules for Firewall Appliances

As you can see, VM placement with VM/Host Rules is very easy to configure and deploy.

Additional Considerations

Note, if you implement these rules and do not have enough hosts to fullfill the requirements, the hosts may fail to be evacuated by DRS when placing in maintenance mode, or remediating with vLCM (Lifecycle Manager).

In this case, you’ll need to manually vMotion the VM’s to other hosts (to violate the rule) or shut some down.

Jul 242023
Picture of an DL360p Gen8 1U Rack Server with IO-PEX40152 Installed

A few months ago, you may have seen my post detailing my experience with ESXi 7.0 on HP Proliant DL360p Gen8 servers. I now have an update as I have successfully loaded ESXi 8.0 on HPE Proliant DL360p Gen8 servers, and want to share my experience.

It wasn’t as eventful as one would have expected, but I wanted to share what’s required, what works, and stability observations.

Please note, this is NOT supported and NOT recommended for production environments. Use the information at your own risk.

A special thank you goes out to William Lam and his post on Homelab considerations for vSphere 8, which provided me with the boot parameter required to allow legacy CPUs.

ESXi on the DL360p Gen8

With the release of vSphere 8.0 Update 1, and all the new features and functionality that come with the vSphere 8 release as a whole, I decided it was time to attempt to update my homelab.

In my setup, I have the following:

  • 2 x HPE Proliant DL360p Gen8 Servers
    • Dual Intel Xeon E5-2660v2 Processors in each server
    • USB and/or SD for booting ESXi
    • No other internal storage
    • NVIDIA A2 vGPU (for use with VMware Horizon)
  • External SAN iSCSI Storage

Since I have 2 servers, I decided to do a fresh install using the generic installer, and then use the HPE addon to install all the HPE addons, drivers, and software. I would perform these steps on one server at a time, continuing to the next if all went well.

I went ahead and documented the configuration of my servers beforehand, and had already had upgraded my VMware vCenter vCSA appliance from 7U3 to 8U1. Note, that you should always upgrade your vCenter Server first, and then your ESXi hosts.

To my surprise the install went very smooth (after enabling legacy CPUs in the installer) on one of the hosts, and after a few days with no stability issues, I then proceeded and upgraded the 2nd host.

I’ve been running with 100% for 25+ days without any issues.

The process – Installing ESXi 8.0

I used the following steps to install VMware vSphere ESXi 8 on my HP Proliant Gen8 Server:

  1. Download the Generic ESXi installer from VMware directly.
    1. Link: Download VMware vSphere
  2. Download the “HPE Custom Addon for ESXi 8.1”.
    1. Link: HPE Custom Addon for ESXi 8.0 U1 June 2023
    2. Other versions of the Addon are here: HPE Customized ESXi Image.
  3. Boot server with Generic ESXi installer media (CD or ISO)
    • IMPORTANT: Press “Shift + o” (Shift key, and letter “o”) to interrupt the ESXi boot loader, and add “AllowLegacyCPU=true” to the kernel boot parameters.
  4. Continue to install ESXi as normal.
    • You may see warnings about using a legacy CPU, you can ignore these.
  5. Complete initial configuration of ESXi host
  6. Mount NFS or iSCSI datastore.
  7. Copy HPE Custom Addon for ESXi zip file to datastore.
  8. Enable SSH on host (or use console).
  9. Place host in to maintenance mode.
  10. Run “esxcli software vib install -d /vmfs/volumes/datastore-name/folder-name/HPE-801.” from the command line.
  11. The install will run and complete successfully.
  12. Restart your server as needed, you’ll now notice that not only were HPE drivers installed, but also agents like the Agentless management agent, and iLO integrations.

After that, everything was good to go… Here you can see version information from one of the ESXi hosts:

ESXi 8 on HPE Proliant DL360p Gen8
VMware ESXi version 8.0.1 Build 21813344 on HPE Proliant DL360p Gen8 Server

What works, and what doesn’t

I was surprised to see that everything works, including the P420i embedded RAID controller. Please note that I am not using the RAID controller, so I have not performed extensive testing on it.

HPE P420i RAID Controller with VMware vSphere ESXi 8
HPE P420i RAID Controller with VMware vSphere ESXi 8

All Hardware health information is present, and ESXi is functioning as one would expect if running a supported version on the platform.

Additional Information

Note that with vSphere 8, VMware is deprecating vLCM baselines. Your focus should be to update your ESXi instances using cluster image based update images. You can incorporate vendor add-ons and components to create a customized image for deployment.

Jul 232023
Azure AD SSO with Horizon

With the release of VMware Horizon 2303, VMware Horizon now supports Hybrid Azure AD Join with Azure AD Connect using Instant Clones and non-persistent VDI.

So what exactly does this mean? It means you can now use Azure SSO using PRT (Primary Refresh Token) to authenticate and access on-premise and cloud based applications and resources.

What else? It allows you to use conditional access!

What is Hybrid Azure AD Join, and why would we want to do it with Azure AD Connect?

Historically, it was a bit challenging when it came to Understanding Microsoft Azure AD SSO with VDI (click to read the post and/or see the video), and special considerations had to be made when an organization wished to implement SSO between their on-prem non-persistent VDI deployment and Azure AD.

Screenshot of a Hybrid Azure AD Joined login
Hybrid Azure AD Joined Login

Azure AD SSO, the old way

The old way to accomplish this was to either implement Azure AD with ADFS, or use Seamless SSO. ADFS was bulky and annoying to manage, and Seamless SSO was actually intended to enable SSO on “downlevel devices” (older operating systems before Windows 10).

For customers without ADFS, I would always recommend using Seamless SSO to enable SSO on non-persistent VDI Instant Clones, until now!

Azure AD SSO, the new way with Azure AD Connect and Azure SSO PRTs

According to the release notes for VMware Horizon 2303:

Hybrid Azure Active Directory for SSO is now supported on instant clone desktop pools. See KB 89127 for details.

This means we can now enable and use Azure SSO with PRTs (Primary Refresh Tokens) using Azure AD Connect and non-persistent VDI Instant Clones.

Azure SSO with PRT and Non-Persistent VDI

This is actually a huge deal because not only does it allow us to use the preferred method for performing SSO with Azure, but it also allows us to start using fancy Azure features like conditional access!

Requirements for Hybrid Azure AD Join with non-persistent VDI and Azure AD Connect

In order to utilize Hybrid Join and PRTs with non-persistent VDI on Horizon, you’ll need the following:

  • VMware Horizon 2303 (or later)
  • Active Directory
  • Azure AD Connect (Implemented, Configured, and Functioning)
    • Azure AD Hybrid Domain Join must be enabled
    • OU and Object filtering must include the non-persistent computer objects and computer accounts
  • Create a VMware Horizon Non-Persistent Desktop Pool for Instant Clones
    • “Allow Reuse of Existing Computer Accounts” must be checked

When you configure this, you’ll notice that after provisioning a desktop pool (or pushing a new snapshot), that there may be a delay for PRTs to be issued. This is expected, however the PRT will be issued eventually, and subsequent desktops shouldn’t experience issues unless you have a limited number available.

*Please note: VMware still notes that ADFS is the preferred way for fast issuance of the PRT.

While VMware does recommend ADFS for performance when issuing PRTs, in my own testing I had no problems or complaints, however when deploying this in production I’d recommend that because of the PRT delay after deploying the pool or a new snapshot, to do this after hours or SSO will not function for some users who immediately get a new desktop.

Additional Considerations

Please note the following:

  • When switching from ADFS to Azure AD Connect, the sign-in process may change for users.
    • You must prepare the users for the change.
  • When using locally stored identifies and/or cached credentials, enabling Azure SSO may change the login process, or cause issues for users signing in.
    • You may have to delete saved credentials in the users persistent profile
    • You may have to adjust GPOs to account for Azure SSO
    • You may have to modify settings in your profile persistent solution
      • Example: “RoamIdentity” on FSLogix
  • I recommend testing before implementing
    • Test Environment
    • Test with new/blank user profiles
    • Test with existing users

If you’re coming from an environment that was previously using Seamless SSO for non-persistent VDI, you can create new test desktop pools that use newly created Active Directory OU containers and adjust the OU filtering appropriately to include the test OUs for synchronization to Azure AD with Azure AD Connect. This way you’re only syncing the test desktop pool, while allowing Seamless SSO to continue to function for existing desktop pools.

How to test Azure AD Hybrid Join, SSO, and PRT

To test the current status of Azure AD Hybrid Join, SSO, and PRT, you can use the following command:

dsregcmd /status

To check if the OS is Hybrid Domain joined, you’ll see the following:

| Device State                                                         |

             AzureAdJoined : YES
          EnterpriseJoined : NO
              DomainJoined : YES
                DomainName : DOMAIN

As you can see above, “AzureADJoined” is “YES”.

Further down the output, you’ll find information related to SSO and PRT Status:

| SSO State                                                            |

                AzureAdPrt : YES
      AzureAdPrtUpdateTime : 2023-07-23 19:46:19.000 UTC
      AzureAdPrtExpiryTime : 2023-08-06 19:46:18.000 UTC
       AzureAdPrtAuthority : https://login.microsoftonline.com/XXXXXXXX-XXXX-XXXXXXX
             EnterprisePrt : NO
    EnterprisePrtAuthority :
                 OnPremTgt : NO
                  CloudTgt : YES
         KerbTopLevelNames : XXXXXXXXXXXXX

Here we can see that “AzureAdPrt” is YES which means we have a valid Primary Refresh Token issued by Azure AD SSO because of the Hybrid Join.

Mar 062023
VMware vSphere 7 Logo

You might ask if/what the procedure is for updating Enhanced Linked Mode vCenter Server Instances, or is there even any considerations that apply?

vCenter Enhanced Link Mode is a feature that allows you to link a total of 15 vCenter Instances in to a single, Single Sign On (SSO) vSphere domain. This allows you to have a single set of credentials to manage all 15 instances, as well as the ability to manage all of them from a single pane of glass.

When it comes to environments with multiple vCenter instance and/or vCSA appliances, this really helps manageability, and visibility.

Enhanced Linked Mode Upgrade Considerations

To answer the question above: Yes, when you’re running Enhanced Linked Mode (ELM) to link multiple vCenter Server, special considerations and requirements exist when it comes to updating or upgrading your vCenter Server instances and vCSA appliances.

Multiple VMware vCenter Server Instances (vCSA) Running in Enhanced Link Mode (ELM)
Multiple VMware vCenter Server Instances (vCSA) Running in Enhanced Link Mode (ELM)

Not only have these procedures been documented in older VMware documentation, but I recently reviewed and confirmed the best practices with VMware GSS while on a support case.

Procedure for updating vCenter with ELM

  1. Configure/Confirm that the vCenter File-Based Backup in VAMI is configured, functioning, and that you are creating valid file based backups.
  2. Create a manual file-based backup with VAMI
  3. Power down all vCenter Instances and vCSA Appliances in your environment
  4. Perform a cold snapshot of all vCenter Instances and vCSA appliances
    • *This is critical* – You need a valid offline snapshot taken of all appliances powered off at the same point in time
  5. Power on the vCenter/vCSA Virtual Machines (VMs)
  6. Perform the update or upgrade

Recovering from a failed Update

IMPORTANT: In the event that an update or upgrade fails, you must revert all vCenter Instances and/or vCSA appliances back to the previous snapshot!

You cannot selectively choose single or individual instances, as this may cause mismatches in data and configuration between the instances as they have databases that are not in sync, and are from different points in time.

Additionally, if you are in a situation where you’re considering or planning to restore previous snapshots to recover from a failed update, you should do so sooner than later. As time progresses, service accounts and identifiers update in the VMware vSphere infrastructure. Delaying the restore too long could cause this information to get out of sync with the ESXi hosts after performing a snapshot restore/revert.

Mar 052023

In this NVIDIA vGPU Troubleshooting Guide, I’ll help show you how to troubleshoot vGPU issues on VMware platforms, including VMware Horizon and VMware Tanzu. This guide applies to the full vGPU platform, so it’s relevant for VDI, AI, ML, and Kubernetes workloads.

This guide will provide common troubleshooting methods, along with common issues and problems associated with NVIDIA vGPU as well as their fixes.

Please note, there are numerous other additional methods available to troubleshoot your NVIDIA vGPU deployment, including 3rd party tools. This is a general document provided as a means to get started learning how to troubleshoot vGPU.


NVIDIA vGPU is a technology platform that includes a product line of GPUs that provide virtualized GPUs (vGPU) for Virtualization environments. Using a vGPU, you can essentially “slice” up a physical GPU and distribute Virtual GPUs to a number of Virtual Machines and/or Kubernetes containers.

Picture of NVIDIA A2 vGPU installed in VMware ESXi Server
NVIDIA vGPU Installed in VMware ESXi Host

These virtual machines and containers can then use these vGPU’s to provide accelerated workloads including VDI (Virtual Desktop Infrastructure), AI (Artificial Intelligence), and ML (Machine Learning).

While the solution works beautifully, when deployed incorrectly or if the solution isn’t maintained, issues can occur requiring troubleshooting and remediation.

At the end of this blog post, you’ll find some additional (external) links and resources, which will assist further in troubleshooting.

Troubleshooting Index

Below, you’ll find a list of my most commonly used troubleshooting methods.

Please click on an item below which will take you directly to the section in this post.

Common Problems Index

Below is a list of problems and issues I commonly see customers experience or struggle with in their vGPU enabled VMware environments.

Please click on an item below which will take you directly to the section in this post.


Using “nvidia-smi”

The NVIDIA vGPU driver comes with a utility called the “NVIDIA System Management Interface”. This CLI program allows you to monitor, manage, and query your NVIDIA vGPU (including non-vGPU GPUs).

Screenshot of "nvidia-smi" command running on VMware ESXi host with NVIDIA GPU
NVIDIA vGPU “nvidia-smi” command

Simply running the command with no switches or flags, allow you to query and pull basic information on your vGPU, or multiple vGPUs.

For a list of available switches, you can run: “nvidia-smi -h”.

Running “nvidia-smi” on the ESXi Host

To use “nvidia-smi” on your VMware ESXi host, you’ll need to SSH in and/or enable console access.

When you launch “nvidia-smi” on the ESXi host, you’ll see information on the physical GPU, as well as the VM instances that are consuming a virtual GPU (vGPU). This usage will also provide information like fan speeds, temperatures, power usage and GPU utilization.

[root@ESXi-Host:~] nvidia-smi
Sat Mar  4 21:26:05 2023
| NVIDIA-SMI 525.85.07    Driver Version: 525.85.07    CUDA Version: N/A      |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A2           On   | 00000000:04:00.0 Off |                  Off |
|  0%   36C    P8     8W /  60W |   7808MiB / 16380MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A   2108966    C+G   VM-WS02                          3904MiB |
|    0   N/A  N/A   2108989    C+G   VM-WS01                          3904MiB |

This will aid with troubleshooting potential issues specific to the host or the VM. The following pieces of information are helpful:

  • Driver Version
  • GPU Fan and Temperature Information
  • Power Usage
  • GPU Utilization (GPU-Util)
  • ECC Information and Error Count
  • Virtual Machine VMs assigned a vGPU
  • vGPU Type (C+G means Compute and Graphics)

Additionally, instead of running once, you can issue “nvidia-smi -l x” replacing “x” with the number of seconds you’d like it to auto-loop and refresh.


nvidia-smi -l 3

The above would refresh and loop “nvidia-smi” every 3 seconds.

For vGPU specific information from the ESXi host, you can run:

nvidia-smi vgpu
root@ESXi-Host:~] nvidia-smi vgpu
Mon Mar  6 11:47:44 2023
| NVIDIA-SMI 525.85.07              Driver Version: 525.85.07                 |
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|   0  NVIDIA A2                  | 00000000:04:00.0             |   0%       |
|      3251713382  NVIDIA A2-4Q   | 2321577  VMWS01              |      0%    |

This command shows information on the vGPU instances currently provisioned.

There are also a number of switches you can throw at this to get even more information on vGPU including scheduling, vGPU types, accounting, and more. Run the following command to view the switches:

nvidia-smi vgpu -h

Another common switch I use on the ESXi host with vGPU for troubleshooting is: “nvidia-smi -q”, which provides lots of information on the physical GPU in the host:

[root@ESXi-HOST:~] nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Sat Mar  4 21:26:18 2023
Driver Version                            : 525.85.07
CUDA Version                              : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU           : Supported

Attached GPUs                             : 1
GPU 00000000:04:00.0
    Product Name                          : NVIDIA A2
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    vGPU Device Capability
        Fractional Multi-vGPU             : Not Supported
        Heterogeneous Time-Slice Profiles : Supported
        Heterogeneous Time-Slice Sizes    : Not Supported
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Enabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : XXXN0TY0SERIALZXXX
    GPU UUID                              : GPU-de23234-3450-6456-e12d-bfekgje82743a
    Minor Number                          : 0
    VBIOS Version                         : 94.07.5B.00.92
    MultiGPU Board                        : No
    Board ID                              : 0x400
    Board Part Number                     : XXX-XXXXX-XXXX-XXX
    GPU Part Number                       : XXXX-XXX-XX
    Module ID                             : 1
    Inforom Version
        Image Version                     : G179.0220.00.01
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Host VGPU
        Host VGPU Mode                    : SR-IOV
        Relaxed Ordering Mode             : N/A
        Bus                               : 0x04
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x25B610DE
        Bus Id                            : 00000000:04:00.0
        Sub System Id                     : 0x157E10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 16380 MiB
        Reserved                          : 264 MiB
        Used                              : 7808 MiB
        Free                              : 8306 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 1 MiB
        Free                              : 16383 MiB
    Compute Mode                          : Default
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 64 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
        GPU Current Temp                  : 37 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 96 C
        GPU Slowdown Temp                 : 93 C
        GPU Max Operating Temp            : 86 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 8.82 W
        Power Limit                       : 60.00 W
        Default Power Limit               : 60.00 W
        Enforced Power Limit              : 60.00 W
        Min Power Limit                   : 35.00 W
        Max Power Limit                   : 60.00 W
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 795 MHz
    Applications Clocks
        Graphics                          : 1770 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 1770 MHz
        Memory                            : 6251 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1770 MHz
        SM                                : 1770 MHz
        Memory                            : 6251 MHz
        Video                             : 1650 MHz
    Max Customer Boost Clocks
        Graphics                          : 1770 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
        Graphics                          : 650.000 mV
        State                             : N/A
        Status                            : N/A
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2108966
            Type                          : C+G
            Name                          : VM-WS02
            Used GPU Memory               : 3904 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2108989
            Type                          : C+G
            Name                          : VM-WS01
            Used GPU Memory               : 3904 MiB

As you can see, you can pull quite a bit of information in detail from the vGPU, as well as the VM processes.

Running “nvidia-smi” on the VM Guest

You can also run “nvidia-smi” inside of the guest VM, which will provide you information on the vGPU instance that is being provided to that specific VM, along with information on the guest VM’s processes that are utilizing the GPU.

Screenshot of "nvidia-smi" running on guest virtual machine VM
“nvidia-smi” Running on Guest VM

This is helpful for providing information on the guest VM’s usage of the vGPU instance, as well as processes that require GPU usage.

Virtual Machine log files

Each Virtual Machine has a “vmware.log” file inside of the VM’s folder on the datastore.

To identify logging events pertaining to NVIDIA vGPU, you can search for the “vmiop” string inside of the vmware.log file.


cat /vmfs/volumes/DATASTORE/VirtualMachineName/vmware.log | grep -i vmiop

The above will read out any lines inside of the log that have the “vmiop” string inside of them. The “-i” flag instructs grep to ignore case sensitivity.

This logs provide initialization information, licensing information, as well as XID error codes and faults.

ESXi Host log files

Additionally, since the ESXi host is running the vGPU Host Driver (vGPU Manager), it also has logs that pertain and assist with vGPU troubleshooting.

Some commands you can run are:

cat /var/log/vmkernel.log | grep -i vmiop
cat /var/log/vmkernel.log | grep -i nvrm
cat /var/log/vmkernel.log | grep -i nvidia

The above commands will pull NVIDIA vGPU related log items from the ESXi log files.

Using “dxdiag” in the guest VM

Microsoft has a tool called “dxdiag” which provides diagnostic infromation for testing and troubleshooting video (and sound) with DirectX.

I find this tool very handy for quickly verifying

Microsoft DirectX "dxdiag" showing information on vGPU
NVIDIA vGPU with Microsoft DirectX “dxdiag” tool

As you can see:

  • DirectDraw Acceleration: Enabled
  • Direct3D Acceleration: Enabled
  • AGP Texture Acceleration: Enabled
  • DirectX 12 Ultimate: Enabled

The above show that hardware acceleration is fully functioning with DirectX. This is a indicator that things are generally working as expected. If you have a vGPU and one of the first three is showing as disabled, then you have a problem that requires troubleshooting. Additionally, if you do not see your vGPU card, then you have a problem that requires troubleshooting.

Please Note: You may not see “DirectX 12 Ultimate” as this is related to licensing.

Using the “VMware Horizon Performance Monitor”

The VMware Horizon Performance Monitor, is a great tool that can be installed by the VMware Horizon Agent, that allows you to pull information (stats, connection information, etc) for the session. Please note that this is not installed by default, and must be selected when running the Horizon Agent installer.

When it comes to troubleshooting vGPU, it’s handy to use this too to confirm you’re getting H.264 or H.265/HEVC offload from the vGPU instance, and also get information on how many FPS (Frames Per Second) you’re getting from the session.

VMware Horizon Performance Monitor showing vGPU NVIDIA NvEnc HEVC as encoder type
VMware Horizon Performance Tracker with NVIDIA vGPU

Once opening, you’ll change the view above using the specified selector, and you can see what the “Encoder Name” is being used to encode the session.

Examples of GPU Offload “Encoder Name” types:

  • NVIDIA NvEnc HEVC 4:2:0 – This is using the vGPU offload using HEVC
  • NVIDIA NvEnc HEVC 4:4:4 – This is using the vGPU offload using HEVC high color accuracy
  • NVIDIA NvEnc H264 4:2:0 – This is using the vGPU offload using H.264
  • NVIDIA NvEnc H264 4:4:4 – This is using the vGPU offload using H.264 high color accuracy

Examples of Software (CPU) Session “Encoder Name” types:

  • BlastCodec – New VMware Horizon “Blast Codec”
  • h264 4:2:0 – Software CPU encoded h.264

If you’re seeing “NVIDIA NvEnc” in the encoder name, then the encoding is being offloaded to the GPU resulting in optimum performance. If you don’t see it, it’s most likely using the CPU for encoding, which is not optimal if you have a vGPU, and requires further troubleshooting.

NVIDIA vGPU Known Issues

Depending on the version of vGPU that you are running, there can be “known issues”.

When viewing the NVIDIA vGPU Documentation, you can view known issues, and fixes that NVIDIA may provide. Please make sure to reference the documentation specific to the version you’re running and/or the version that fixes the issues you’re experiencing.

Common Problems

There are a number of common problems that I come across when I’m contacted to assist with vGPU deployments.

Please see below for some of the most common issues I experience, along with their applicable fix/workaround.

XID Error Codes

When viewing your Virtual Machine VM or ESXi log file, and experiencing an XID error or XID fault, you can usually look up the error codes.

Typically, vGPU errors will provide an “XiD Error” code, which can be looked up on NVIDIA’s Xid Messages page here: XID Errors :: GPU Deployment and Management Documentation (nvidia.com).

The table on this page allows you to lookup the XID code, find the cause, and also provides information if the issue is realted to “HW Error” (Hardware Error), “Driver Error”, “User App Error”, “System Memory Corruption”, “Bus Error”, “Thermal Issue”, or “FB Corruption”.

An example:

2023-02-26T23:33:24.396Z Er(02) vthread-2108265 - vmiop_log: (0x0): XID 45 detected on physical_chid:0x60f, guest_chid:0xf
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): Timeout occurred, reset initiated.
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x52445456 0x00e207e8 0x000001cc 0x00000001
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x00989680 0x00000000 0x000001bb 0x0000000f
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x00000100 0x00000000 0x0000115e 0x00000000
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001600 0x00000000
2023-02-26T23:33:36.023Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x00002214 0x00000000 0x00000000 0x00000000

2023-02-26T23:33:36.024Z Er(02) vthread-2108266 - vmiop_log: (0x0): TDR_DUMP:0x64726148 0x00736964 0x00000000 0x00000000
2023-02-26T23:33:36.068Z Er(02) vthread-2108265 - vmiop_log: (0x0): XID 43 detected on physical_chid:0x600, guest_chid:0x0

One can see XID code 45, as well as XID code 43, which after looking up on NVIDIA’s document, states:

  • XID 43 – GPU stopped processing
    • Possible Cause: Driver Error
    • Possible Cause: User App Error
  • XID 45 – Preemptive cleanup, due to previous errors — Most likely to see when running multiple cuda applications and hitting a DBE
    • Possible Cause: Driver Error

In the situation above, one can deduce that the issue is either Driver Error, Application Error, or a combination of both. In this specific case, you could try changing drivers to troubleshoot.

vGPU Licensing

You may experience issues in your vGPU deployment due to licensing issues. Depending on how you have you environment configured, you may be running in an unlicensed mode and not be aware.

In the event that the vGPU driver cannot obtain a valid license, it will run for 20 minutes with full capabilities. After that the performance and functionality will start to degrade. After 24 hours it will degrade even further.

Some symptoms of issues experienced when unlicensed:

  • Users experiencing laggy VDI sessions
  • Performance issues
  • Frames per Second (FPS) limited to 15 fps or 3 fps
  • Applications using OpenCL, CUDA, or other accelerated APIs fail

Additionally, some error messages and event logs may occur:

  • Event ID 2, “NVIDIA OpenGL Driver” – “The NVIDIA OpenGL driver has not been able to initialize a connection with the GPU.”
  • AutoCAD/Revit – “Hardware Acceleration is disabled. Software emulation mode is in use.”
  • “Guest is unlicensed”

Please see below for screenshots of said errors:

Additonally, when looking at the Virtual Machine VM vmware.log (inside of the VM’s folder on the ESXi datastore), you may see:

Guest is unlicensed. Cannot allocate more than 0x55 channels!
VGPU message 6 failed, result code: 0x1a

If this occurs, you’ll need to troubleshoot your vGPU licensing and resolve any issues occurring.

vGPU Type (vGPU Profile) mismatch

When using the default (“time-sliced”) vGPU deployment method, only a single vGPU type can be used on virtual machines or containers per physical GPU. Essentially all VMs or containers utilizing the physical GPU must use the same vGPU type.

If the physical GPU card has multiple GPUs (GPU chips), then a different type can be used on each physical GPU chip on the same card. 2 x GPUs on a single card = 2 different vGPU types.

Additionally, if you have multiple cards inside of a single host, the number of vGPU types you can deployed is based off the total number of GPUs across the total number of cards in your host.

If you configure multiple vGPU types and cannot support it, you will have issues starting VMs, as shown below:

Cannot power on VM with vGPU due to insufficient resources
Cannot power on VM with vGPU: Power on Failure, Insuffiecient resources

The error reads as follows:

Power On Failures

vCenter Server was unable to find a suitable host to power on the following virtual machines for the reasons listed below.

Insufficient resources. One or more devices (pciPassthru0) required by VM VDIWS01 are not available on host ESXi-Host.

Additionally, if provisioning via VMware Horizon, you may see: “NVIDIA GRID vGPU Support has detected a mismatch with the supported vGPUs”

Note: If you are using MIG (Multi Instance GPU), this does not apply as different MIG types can be applied to VMs from the same card/GPU.

vGPU or Passthrough with 16GB+ of Video RAM Memory

When attaching a vGPU to a VM, or passing through a GPU to a VM, with 16GB or more of Video RAM (Framebuffer memory), you may run in to a situation where the VM will not boot.

This is because the VM cannot map that large of memory space to be accesible for use.

Please see my blog post GPU or vGPU Passthrough with 16GB+ of video memory, for more information as well as the fix.

vGPU VM Freezes during VMware vMotion

Your users may report issues where their VDI guest VM freezes for a period of time during use. This could be caused due to VMware vMotion moving the virtual machine from one VMware ESXi host to another.

Please see my blog post NVIDIA vGPU VM Freezes during VMware vMotion: vGPU STUN Time for more information.

“ERR!” State

When experiencing issues, you may notice that “nvidia-smi” throws “ERR!” in the view. See the example below:

nvidia-smi showing ERR! error state on VMware ESXi host with vGPU
NVIDIA vGPU “nvidia-smi” reporting “ERR!”

This is an indicator that you’re in a fault or error state, and would recommend checking the ESXi Host log files, and the Virtual Machine log files for XID codes to identify the problem.

vGPU Driver Mismatch

When vGPU is deployed, drivers are installed on the VMware ESXi host (vGPU Manager Driver), as well as the guest VM virtual machine (guest VM driver).

Guest VM vGPU driver mismatch with VMware ESXi host
NVIDIA vGPU Driver Mismatch

These two drivers must be compatible with each other. As per NVIDIA’s Documentation, see below for compatibility:

  • NVIDIA vGPU Manager with guest VM drivers from the same release
  • NVIDIA vGPU Manager with guest VM drivers from different releases within the same major release branch
  • NVIDIA vGPU Manager from a later major release branch with guest VM drivers from the previous branch

Additionally, if you’re using the LTS (Long Term Support Branch), the additional compatibility note applies.

  • NVIDIA vGPU Manager from a later long-term support branch with guest VM drivers from the previous long-term support branch

If you have a vGPU driver mismatch, you’ll likely see Event ID 160 from “nvlddmkm” reporting:

NVIDIA driver version mismatch error: Guest driver is incompatible with host drive.

To resolve this, you’ll need to change drivers on the ESXi host and/or Guest VM to a supported combination.

Upgrading NVIDIA vGPU

When upgrading NVIDIA vGPU drivers on the host, you may experience issues or errors stating that the NVIDIA vGPU modules or services are loaded and in use, stopping your ability to upgrade.

Normally an upgrade would be preformed by placing the host in maintenance mode and running:

esxcli software vib update -d /vmfs/volumes/DATASTORE/Files/vGPU-15/NVD-VGPU-702_525.85.07-1OEM.702.0.0.17630552_21166599.zip

However, this fails due to modules that are loaded and in use by the NVIDIA vGPU Manager Services.

Before attempting to upgrade (or uninstall and re-install), place the host in maintenance mode and run the following command:

/etc/init.d/nvdGpuMgmtDaemon stop

This should allow you to proceed with the upgrade and/or re-install.

VMware Horizon Black Screen

If you experiencing a blank or black screen when connecting to a VDI session with an NVIDIA vGPU on VMware Horizon, it may not even be related to the vGPU deployment.

To troubleshoot the VMware Horizon Black Screen, please review my guide on how to troubleshoot a VMware Horizon Blank Screen.

VM High CPU RDY (High CPU Ready)

CPU RDY (CPU Ready) is a state when a VM is ready and waiting to be scheduled on a physical host’s CPU. In more detail, the VM’s vCPUs are ready to be scheduled on the ESXi host’s pCPUs.

In rare cases, I have observed situations where VMs with a vGPU and high CPU RDY times, experience instability. I believe this is due to timing conflicts with the vGPU’s time slicing, and the VM’s CPU waiting to be scheduled.

To check VM CPU RDY, you can use one of the following methods:

  1. Run “esxtop” from the CLI using the console or SSH
  2. View the hosts performance stats on vCenter
    • Select host, “Monitor”, “Advanced”, “Chart Options”, de-select all, select “Readiness Average %”

When viewing the CPU RDY time in a VDI environment, generally we’d like to see CPU RDY at 3 or lower. Anything higher than 3 may cause latency or user experience issues, or even vGPU issues at higher values.

For your server virtualization environment (non-VDI and no vGPU), CPU Ready times are not as big of a consideration.

vGPU Profiles Missing from VMware Horizon

When using newer GPUs with older versions of VMware Horizon, you may encounter an issue with non-persistent instant clones resulting in a provisioning error.

This is caused by missing vGPU Types or vGPU Profiles, and requires either downloading the latest definitions, or possibly creating your own.

For more information on this issue, please see my post NVIDIA A2 vGPU Profiles Missing from VMware Horizon causing provision failure.

Please see these these additional external links and resources which may assist.

Feb 252023

When using VMware vSphere, you may notice vCenter OVF Import and Datastore File Access Issues, when performing various tasks with OVF Imports, as well as uploading and/or downloading files from datastores.

These issues can cause a number of symptoms including errors, unexpected status codes, and also just simply failing for an undetermined reason.

vCenter File Upload failed error "The Operation failed."
vCenter File Upload: The Operation failed.

The Problem

For this situation, the symptoms will occur when performing one of the following tasks:

  • Cannot Upload File to datastore
  • Cannot Download File from datastore
  • Cannot Import OVF Template
  • Cannot Export OVF Template

An example of errors that the user may see:

  • The operation failed for an undetermined reason.
  • The operation failed.
  • unexpectedStatusCode":0
  • unexpectedStatusCode (0)
  • HTTP 500 Error

See below for some example screenshots of errors you may see.

vCenter Error: "The operation failed for an undetermined reason."
The Operation failed: The Operation failed for an undetermined reason.
Chrome and vCenter report "NET::ERR_CERT_AUTHORITY_INVALID" error

Please note, that this condition can cause other issues and errors as well.

The Solution

When using VMware vSphere, the vCenter server acts as it’s own Root Certification Authority, and uses SSL certificates to facilitate communication and encryption between various services in the solution, as well as the communication between the vCenter Server, ESXi hosts, and any client computers accessing vCenter via the web HTML5 interface.

This Root Certification Authority running on the vCenter Server creates and issues certificates to these services and hosts, which are issues under the Root CA Certificate.

While vCenter automatically handles the certificate trusts between the services, as well as the communicate between the vCenter Server and ESXi hosts (this is automatically setup when adding hosts to vCenter), it cannot automatically make your (client) computer trust the entire certificate authority, as well as all the child certificates.

To resolve this issue, you’ll need to download and install the trusted root CA certificates that belong to your vCenter server:

  1. Open your web browser to the FQDN of your vCenter server (do not go to the login page).
  2. Right click on “Download trusted root CA certificates”, and click on save link as.
    Link to download vCenter trusted root CA Certificates
  3. Save this ZIP file to your computer, and extract the archive file (you must extract it first).
  4. Navigate through the applicable folders (certs/win in my case) and locate the certificates.
    Screenshot of vCenter Server Root CA Certificate files
  5. For each file that has the type of “Security Certificate”, right click on it and choose “Install Certificate”.
  6. Change “Store Location” to “Local Machine”
  7. Choose “Place all certificates in the following store”, click Browse, and select “Trusted Root Certification Authorities”.
    Screenshot to Place in Trusted Root Certification Authorities
  8. Finish the wizard, and you will get the acknowledgement “The import was successful.”
  9. Repeat this for each file in that folder with the type of “Security Certificate”.

You’ll have to close all web browser instances and reload the vCenter UI, however you should now be able to successfully upload and download files from the datastores, and also import and export OVF files.

Additionally, you should no longer receive any SSL errors when connecting to your vCenter server.

Jan 112023
HPE Simplivity Logo

When attempting to log in to your VMware vCenter using the HPE Simplivity Upgrade Manager to perform an upgrade on your Simplivity Infrastructure, the login may fail with Access Denied, Incorrect Credentials, or Incorrect Username and Password.

Despite confirming that the credentials are correct (logging in to the vCenter UI, as well as the vCSA console via SSH), the HPE Simplivity Upgrade Manager will continue to fail on connection.

The Problem

During the login process, the HPE Simplivity Upgrade Manager will not only check the credentials and attempt to logon to the vCenter server, but it will also attempt to pull and validate the SSL certificates (whether trusted or not) on the vCenter server.

During the typical login process, after entering the credentials and clicking “Connect”, the user will be prompted with the SSL certificate information asking to approve the connection. In this specific circumstance the SSL window is not presented.

HPE Simplivity Upgrade Manager Login Failed

Because of the SSL check not being presented, I thought there may have been a chance with trusting the connection, and possibly HPE Simplivity wasn’t able to show the error specific to the SSL check failing.

vCenter Download Trusted Root CA Certificates

When clicking on this, I was presented with an HTTP 404 error (File not found), meaning the certficiates weren’t present, which I felt may be contributing or causing this problem.

The Solution

After doing a quick search, I was able to find a VMware KB 89325 addressing the issue of being Unable to download Trusted Root Certificates for VCSA because it shows 0kb file.

Logging in to the vCSA appliance, I was able to determine that the appliance was missing the certificate symlink to allow the certificate download by running this command:

ls -ltra /etc/vmware-vpx/docRoot

Inside of the directory listing, there was no symlink for certs, which should point to “/var/lib/vmware-vpx/docRoot/certs”.

I went ahead and created the symlink using the following command:

ln -sfn /var/lib/vmware-vpx/docRoot/certs /etc/vmware-vpx/docRoot/certs

When using the “ls -ltra /etc/vmware-vpx/docRoot” command from above, I was now able to verify that the symlink existed:

vCenter DocRoot showing “certs” symbolic link

After creating the symlink, I was able to download the Trusted Root CA zip file (you don’t need to do anything with this file as the download was just a test).

I now went back to the Upgrade Manager to attempt to login, and it was successful.

Jan 082023
VMware vSAN All VMs Inaccessible

When using VMware vSAN 7.0 Update 3 (7U3) and using the graceful shutdown (and restart) of your entire vSAN cluster, you may experience an issue resulting with all VMs inaccessible after everything has been powered back on and the hosts taken out of maintenance mode.

If you experience this issue, you will also notice that your vSAN datastore appears to be empty (files and VMs), however you can see that there is data used on the datastore (data usage calculation).

The Problem

As of vSAN 7.0 Update 3, users can now gracefully shutdown and restart their entire vSAN cluster from the GUI instead of having to use the CLI/SSH. While you can still Manually Shut Down and Restart the vSAN Cluster, as one can expect if there’s any easy way to do it via the GUI, it’ll get used.

Last night I had a customer call who used this feature, and when bringing up their cluster, all the VMs were marked as inaccessible and the datastore appeared to be empty. What was even more odd is that all the vSAN health information pertaining to the disks looked good.

Connecting to troubleshoot this (with my limited experience with vSAN), I attempted the following:

  • Restart vSAN Management Services on all ESXi Hosts
  • Restart vSAN Health Services on the vCenter vCSA (then wait 15 minutes and restart ESXi vSAN Manage Services)
  • Restart one of the ESXi hosts (to troubleshoot quorum)
  • Troubleshoot Networking (Issues occurred after physical maintenance)
    • Check MTUs
    • Check LAGs (for vSAN Storage Network)
    • Check Communication and Traffic

After doing all of the above, the VMs still were not accessible.

I had a feeling that this was related to the shutdown and restart (power on) process, so tried to manually start the vSAN cluster using the following command:

python /usr/lib/vmware/vsan/bin/reboot_helper.py recover

This command returned numerous tracebacks, and ultimately timed out after reporting:

Recovery is not ready, retry after 10s...

The Solution

I was convinced this was related to a bug in the automated scripts, so after adjusting my searching, I came across a VMware KB providing information on How to handle inconsistent cluster power status in vSAN shutdown workflow.

I was convinced this would help our issue, however the KB didn’t exactly describe the symptoms and errors we had. Scenario 3 was close, but symptoms were not exact.

At this point, I initiated a VMware Support ticket with VMware GSS, who after checking, confirmed it was the issue in the KB.

The Shutdown script sets “DOMPauseAllCCPs” to 1 (pausing all functions), and “IgnoreClusterMemberListUpdates” to 1. When you choose to Restart and Power on the cluster, these get set back to 0.

In our case, “IgnoreClusterMemberListUpdates” was set back to 0 during the restart and power on, however “DOMPauseAllCCPs” was still set to 1.

After setting DOMPauseAllCCPs” to “0” on all hosts, the VM’s were immediately accessibly, and the issue was resolved.

To check these variables:

esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

To set these variables (to undo what the shutdown script did):

esxcfg-advcfg -s 0 /VSAN/DOMPauseAllCCPs
esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates

When checking or setting these, you must do it on all vSAN nodes (ESXi hosts) in the vSAN Cluster.

Nov 202022

Today I want to talk about Memory Deduplication on ESXi with Transparent Page Sharing (TPS). This is a technology that isn’t widely known about, even amongst IT professionals with significant experience with VMware products.

And you may ask “Memory Deduplication, why aren’t we using this?!?” as it sounds like a pretty cool piece of technology… Well, I’m about to tell you why you’re not (Inter-VM), and share a few examples of where you would want to enable this.

I also want to show you how to enable TPS globally (Inter-VM), and also discuss TPS being used with VMware Horizon and VDI.

What is Transparent Page Sharing (TPS)?

Transparent Page Sharing is the process in which ESXi can provide memory deduplication by storing duplicate memory pages as a single page on the physical memory of the host. This process stops the system from storing redundant memory pages, and thus frees up physical memory for other uses.

If my memory serves me right, this was originally enabled by default in ESX/ESXi version 5, but was later globally disabled due to security vulnerabilities and concerns.

Note, that TPS is still enabled by default from within the same VM, even today with ESXi 8.

Security Concerns

I recall two potential scenarios and security concerns which led to VMware changing the original default behavior for TPS.

  • Scenario 1 included a concern about an attacker gaining access to a VM, and then having the ability to modify the memory contents of another VM.
  • Scenario 2 included a concern where an attacker may be able to get access to encryption keys used on another system.

A quick search led to a KB titled “Security considerations and disallowing inter-Virtual Machine Transparent Page Sharing (2080735)“, which outlines the details of scenario 2, along with stating “This technique works only in a highly controlled system configured in a non-standard way that VMware believes would not be recreated in a production environment”.

With that being said, it sounds like this would be an extremely difficult attack that requires systems to be configured in a non-standard way.

Current status of TPS

Believe it or not, TPS and memory deduplication is still enabled, however it’s only deduplicating pages from within the same VM. TPS is not deduplicating pages from multiple VMs.

Additionally, VMware has given us controls to configure TPS to allow it amongst multiple VMs, or even enable it globally across the ESXi host.

See below for the settings to configure TPS on ESXi via “Advanced Settings”:

A table providing configurable options for Transparent Page Sharing (TPS) on VMware vSphere ESXi
Transparent Page Sharing (TPS) Settings for ESXi Host

The above table was provided by VMware’s “Additional Transparent Page Sharing management capabilities and new default settings (2097593)” KB.

In short, you could enable TPS globally (Inter-VM) by setting “Mem.ShareForceSalting” in “Advanced Settings”, to a value of “0”. You can also use the salting to configure groups of VMs that are allow to share memory pages.

Additionally, you can tweak the behavior of TPS by modifying some of the settings shown below:

TPS Memory Sharing Settings

As you can see you can configure things like the scanning occurrence (Mem.ShareScanTime) of how often the system will check for memory pages that can be shared/deduplicated and other settings.

TPS is enabled, but not working

So, you may have decided to enable TPS in your environment, but you’re noticing that either no, or very little memory pages are being marked as shared.

ESXi Memory Graph showing Memory Deduplication from TPS
TPS Memory Deduplication – Amount of host physical memory that backs shared guest physical memory

In the example above, you’ll notice that on a loaded host, with TPS enabled globally (Inter-VM, amongst all VMs), that the host is only deduplicating 1,052KB of memory.

This is because you will most often only see TPS being heavily utilized on an ESXi host that has over-committed memory, there’s also a chance that you simply don’t have enough memory pages that can be duplicated.

Memory Deduplication, TPS, and VMware Horizon VDI

Because VMware Horizon utilizes the “vmfork” with “Just-in-Time” desktop delivery, non-persistent VDI will benefit from some level of memory deduplication by default when using Instant Clones with non-persistent VDI. This is because non-persistent VDI guests are spawned from a running base image.

Additionally, you can further implement, enable, and configure TPS by configuring some Transparent Page Sharing options inside of the VMware Horizon Administration console.

When creating a Desktop Pool, you can set the “Transparent Page Sharing” open to “Virtual Machine” (Memory dedupe inside of the VM only), “Pool” (Memory dedupe across the Desktop Pool), “Pod” (Dedupe across the pod), or “Global” (Full Inter-VM memory deduplication across the ESXi host).

If you enabled TPS on the ESXi host globally, these settings are null and not used.

TPS Use Cases

So you might be asking when it’s a good time to use TPS?

  • The Homelab – When is a homelab not a good reason to try something? Looking to save some memory and overcommit memory resources? Implement TPS.
  • VDI Environments – On highly dense hosts, you may consider implementing TPS at some level to maximize the utilization of resources, however you must be aware of the security consequences and factor this in when configuring TPS.
  • Environments with no Sensitive Information – It’s hard to imagine, but if you have an environment that doesn’t contain any sensitive information and doesn’t use any security keys, it would be suitable to enable TPS.

I’m sure there’s a number of other use cases, so leave a comment if you can think of one.


In my opinion Transparent Page Sharing is a technology that should not be forgotten and discarded. VMware admins should be aware of it, how to configure it, and what the implications are of using it.

If you are considering enabling TPS in your environment, you must factor in the potential security consequences of doing so.