In this guide I’ll show you how to deploy and install the new Teams for VDI (Virtual Desktop Infrastructure) client, and how to enable Teams Media Optimization on VMware Horizon.
If you haven’t already reviewed my old post Microsoft Teams VDI Optimization for VMware Horizon which covered the old client and optimizations, the new Microsoft Teams app also requires special considerations and installation instructions for VDI environments.
You can run the old and new Teams applications side by side in your environment as you transition users.
Switch between New Teams and old Teams on VDI
Let’s cover what the new Microsoft Teams app is about, and how to install it in your VDI deployment.
Please note that while the new Teams client is in GA, VDI support is in public preview.
New Microsoft Teams app VDI optimized with Toggle for new/old version
Ultimately, it’s way faster, and consumes way less memory. And fortunately for us, it supports media optimizations for VDI environments.
My close friend and colleague, mobile jon, did a fantastic in-depth Deep Dive into the New Microsoft Teams and it’s inner workings that I highly recommend reading.
Interestingly enough, it uses the same media optimization channels for VDI as the old client did, so enablement and/or migrating from the old version is very simple if you’re running VMware Horizon, Citrix, AVD, or Windows 365.
Install New Teams for VDI
While installing the new Teams is fairly simple for non-VDI environment (by simply either enabling the new version in the Teams Admin portal, or using your application manager to deploy the installer), a special method is required to deploy on your VDI images, whether persistent or non-persistent.
Please Note: At the time of this post, the new Teams client auto-updates, even when using the bulk machine mode with the bootstrapper installer. In the future, Microsoft will be releasing a registry key to disable auto-updates for non-persistent VDI golden images.
You will also need to enable Microsoft Teams Media Optimization for the VDI platform you are using (in my case and example, VMware Horizon).
Considerations for New Teams on VDI
At this time, you cannot disable auto-updates (this can be disabled via a registry key in the future, listed as “coming soon”).
New Teams client app uses the same VDI media optimization channels as the old teams (for VMware Horizon, Citrix, AVD, and W365)
If you have already enabled Media Optimization for Teams on VDI for the old version, you can simply install the client using the special bulk installer for all users as shown below, as the new client uses the existing media optimizations.
While it is recommended to uninstall the old client and install the new client, you can choose to run both versions side by side together, providing an option to your users as to which version they would like to use.
Enable Media Optimization for Microsoft Teams
If you haven’t previously for the old client, you’ll need to enable the Teams Media Optimizations for VDI for your VDI platform.
For VMware Horizon, we’ll create a GPO and set the “Enable HTML5 Features” and “Enable Media Optimization for Microsoft Teams”, to “Enabled”. If you have done this for the old Teams app, you can skip this.
Computer Configuration -> Policies -> Administrative Templates -> VMware View Agent Configuration -> VMware HTML5 Features -> VMware WebRTC Redirection Features -> Enable Media Optimization for Microsoft Teams
When installing the VMware Horizon client on Windows computers, you’ll need to make sure you check and enable the “Media Optimization for Microsoft Teams” option on the installer.
VMware Horizon Client Install with Media Optimization for Microsoft Teams
If you are using a thin client or zero client, you’ll need to make sure you have the required firmware version installed, and any applicable vendor plugins installed and/or configurables enabled.
Install New Teams on VDI
At this time, we will now install the new Teams app on to both non-persistent images, and persistent VDI VM guests:
You’ll note that running the command returns success equals true, and Teams is now installed for all users on this machine.
Additionally, the Teams Boot Strap utility can also remove teams for all users on this machine as well by using the “-x” flag. Please see below for all the options for “teamsbootstrapper.exe”:
C:\Users\Administrator.DOMAIN\Downloads>teamsbootstrapper.exe --help
Provisioning program for Microsoft Teams.
Usage: teamsbootstrapper.exe [OPTIONS]
Options:
-p, --provision-admin Provision Teams for all users on this machine.
-x, --deprovision-admin Remove Teams for all users on this machine.
-h, --help Print help
Install New Teams on VMware App Volumes / Citrix App Layering
Using the New Teams bootstrapper, it appears that it evades and doesn’t work with App Packaging and App attaching technologies such as VMware App Volumes and Citrix Application layering.
The New Teams bootstrapper downloads and installs an MSIX app package to the computer running the bootstrapper.
To deploy and install new Teams on VMware App Volumes or Citrix App Layering (or other app technologies), you’ll most likely need to download and import the MSIX package in to the application manager and deploy using that.
Conclusion
It’s great news that we finally have a better performing Microsoft Teams client that supports VDI optimizations. However, it’s unfortunate that we cannot disable the auto-update functionality for non-persistent environments as we do not want update processes to run in vain consuming resources.
I would recommend testing the new Teams client in your VDI environment before deploying to production.
In May of 2023, NVIDIA released the NVIDIA GPU Manager for VMware vCenter. This appliance allows you to manage your NVIDIA vGPU Drivers for your VMware vSphere environment.
Since the release, I’ve had a chance to deploy it, test it, and use it, and want to share my findings.
In this post, I’ll cover the following (click to skip ahead):
The NVIDIA GPU Manager is an (OVA) appliance that you can deploy in your VMware vSphere infrastructure (using vCenter and ESXi) to act as a driver (and update) repository for vLCM (vSphere Lifecycle Manager).
In addition to acting as a repo for vLCM, it also installs a plugin on your vCenter that provides a GUI for browsing, selecting, and downloading NVIDIA vGPU host drivers to the local repo running on the appliance. These updates can then be deployed using LCM to your hosts.
In short, this allows you to easily select, download, and deploy specific NVIDIA vGPU drivers to your ESXi hosts using vLCM baselines or images, simplifying the entire process.
Supported vSphere Versions
The NVIDIA GPU Manager supports the following vSphere releases (vCenter and ESXi):
VMware vSphere 8.0 (and later)
VMware vSphere 7.0U2 (and later)
The NVIDIA GPU Manager supports vGPU driver releases 15.1 and later, including the new vGPU 16 release version.
How to deploy and configure the NVIDIA GPU Manager for VMware vCenter
To deploy the NVIDIA GPU Manager Appliance, we have to download an OVA (from NVIDIA’s website), then deploy and configure it.
See below for the step by step instructions:
Download the NVIDIA GPU Manager
Log on to the NVIDIA Application Hub, and navigate to the “NVIDIA Licensing Portal” (https://nvid.nvidia.com).
Navigate to “Software Downloads” and select “Non-Driver Downloads”
Change Filter to “VMware vCenter” (there is both VMware vSphere, and VMware vCenter, pay attention to select the correct).
To the right of “NVIDIA GPU Manager Plug-in 1.0.0 for VMware vCenter”, click “Download” (see below screenshot).
NVIDIA GPU Manager Download Page
After downloading the package and extracting, you should be left with the OVA, along with Release Notes, and the User Guide. I highly recommend reviewing the documentation at your leisure.
Deploy and Configure the NVIDIA GPU Manager
We will now deploy the NVIDIA GPU Manager OVA appliance:
Deploy the OVA to either a cluster with DRS, or a specific ESXi host. In vCenter either right click a cluster or host, and select “Deploy OVF Template”. Choose the GPU Manager OVA file, and continue with the wizard.
Configure Networking for the Appliance
You’ll need to assign an IP Address, and relevant networking information.
I always recommend creating DNS (forward and reverse entries) for the IP.
Finally, power on Appliance.
We must now create a role and service account that the GPU Manager will use to connect to the vCenter server.
While the vCenter Administrator account will work, I highly recommend creating a service account specifically for the GPU Manager that only has the required permissions that are necessary for it to function.
Log on to your vCenter Server
Click on the hamburger menu item on the top left, and open “Administration”.
Under “Access Control” select Roles.
Select New to create a new role. We can call it “NVIDIA Update Services”.
***PLEASE NOTE: The above permissions were provided in the documentation and did not work for me (resulted in an insufficient privileges error). To resolve this, I chose “Select All” for “VMware vSphere Lifecycle Manager”, which resolved the issue.***
Save the Role
On the left hand side, navigate to “Users and Groups” under “Single Sign On”
Change the domain to your local vSphere SSO domain (vsphere.local by default)
Create a new user account for the NVIDIA appliance, as an example you could use “nvidia-svc”, and choose a secure password.
Navigate to “Global Permissions” on the left hand side, and click “Add” to create a new permission.
Set the domain, and choose the new “nvidia-svc” service account we created, and set the role to “NVIDIA Update Services”, and check “Propagate to Children”.
You have now configured the service account.
Now, we will perform the initial configuration of the appliance. To configure the application, we must do the following:
Access the appliance using your browser and the IP you configured above (or FQDN)
Create a new password for the administrative “vcp_admin” account. This account will be used to manage the appliance.
A secret key will be generated that will allow the password to be reset, if required. Save this key somewhere safe.
We must now register the appliance (and plugin) with our vCenter Server. Click on “REGISTER”.
Enter the FQDN or IP of your vCenter server, the NVIDIA Service account (“nvidia-svc” from example), and password.
Once the GPU Manager is registered with your vCenter server, the remainder of the configuration will be completed from the vCenter GPU.
The registration process will install the GPU Manager Plugin in to VMware vCenter
The registration process will also configure a repository in LCM (this repo is being hosted on the GPU manager appliance).
We must now configure an API key on the NVIDIA Licensing portal, to allow your GPU Manager to download updates on your behalf.
Open your browser and navigate to https://nvid.nvidia.com. Then select “NVIDIA LICENSING PORTAL”. Login using your credentials.
On the left hand side, select “API Keys”.
On the upper right hand, select “CREATE API KEY”.
Give the key a name, and for access type choose “Software Downloads”. I would recommend extending the key validation time, or disabling key expiration.
The key should now be created.
Click on “view api key”, and record the key. You’ll need to enter this in later in to the vCenter GPU Manager plugin.
And now we can finally log on to the vCenter interface, and perform the final configuration for the appliance.
Log on to the vCenter client, click on the hamburger menu, and select “NVIDIA GPU Manager”.
Enter the API key you created above in to the “NVIDIA Licensing Portal API Key” field, and select “Apply”.
The appliance should now be fully configured and activated.
Configuration is complete.
We have now fully deployed and completed the base configuration for the NVIDIA GPU Manager.
Using the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts
In this section, I’ll be providing an overview of how to use the NVIDIA GPU Manager to manage, update, and deploy vGPU drivers to ESXi hosts. But first, lets go over the workflow…
The workflow is a simple one:
Using the vCenter client plugin, you choose the drivers you want to deploy. These get downloaded to the repo on the GPU Manager appliance, and are made available to Lifecycle Manager.
You then use Lifecycle Manager to deploy the vGPU Host Drivers to the applicable hosts, using baselines or images.
As you can see, there’s not much to it, despite all the configuration we had to do above. While it is very simple, it simplifies management quite a bit, especially if you’re using images with Lifecycle Manager.
To choose and download the drivers, load up the plugin, use the filters to filter the list, and select your driver to download.
NVIDIA GPU Manager downloading vGPU Driver
As you can see in the example, I chose to download the vGPU 15.3 host driver. Once completed, it’ll be made available in the repo being hosted on the appliance.
Once LCM has a changed to sync with the updated repos, the driver is then made available to be deployed. You can then deploy using baselines or host images.
LCM Image Update with NVIDIA vGPU Driver from NVIDIA GPU Manager
In the example above, I added the vGPU 16 (535.54.06) host driver to my clusters update image, which I will then remediate and deploy to all the hosts in that cluster. The vGPU driver was made available from the download using GPU Manager.
With the release of VMware Horizon 2303, VMware Horizon now supports Hybrid Azure AD Join with Azure AD Connect using Instant Clones and non-persistent VDI.
So what exactly does this mean? It means you can now use Azure SSO using PRT (Primary Refresh Token) to authenticate and access on-premise and cloud based applications and resources.
What else? It allows you to use conditional access!
What is Hybrid Azure AD Join, and why would we want to do it with Azure AD Connect?
Historically, it was a bit challenging when it came to Understanding Microsoft Azure AD SSO with VDI (click to read the post and/or see the video), and special considerations had to be made when an organization wished to implement SSO between their on-prem non-persistent VDI deployment and Azure AD.
Hybrid Azure AD Joined Login
Azure AD SSO, the old way
The old way to accomplish this was to either implement Azure AD with ADFS, or use Seamless SSO. ADFS was bulky and annoying to manage, and Seamless SSO was actually intended to enable SSO on “downlevel devices” (older operating systems before Windows 10).
For customers without ADFS, I would always recommend using Seamless SSO to enable SSO on non-persistent VDI Instant Clones, until now!
Azure AD SSO, the new way with Azure AD Connect and Azure SSO PRTs
Hybrid Azure Active Directory for SSO is now supported on instant clone desktop pools. See KB 89127 for details.
This means we can now enable and use Azure SSO with PRTs (Primary Refresh Tokens) using Azure AD Connect and non-persistent VDI Instant Clones.
Azure SSO with PRT and Non-Persistent VDI
This is actually a huge deal because not only does it allow us to use the preferred method for performing SSO with Azure, but it also allows us to start using fancy Azure features like conditional access!
Requirements for Hybrid Azure AD Join with non-persistent VDI and Azure AD Connect
In order to utilize Hybrid Join and PRTs with non-persistent VDI on Horizon, you’ll need the following:
VMware Horizon 2303 (or later)
Active Directory
Azure AD Connect (Implemented, Configured, and Functioning)
Azure AD Hybrid Domain Join must be enabled
OU and Object filtering must include the non-persistent computer objects and computer accounts
Create a VMware Horizon Non-Persistent Desktop Pool for Instant Clones
“Allow Reuse of Existing Computer Accounts” must be checked
When you configure this, you’ll notice that after provisioning a desktop pool (or pushing a new snapshot), that there may be a delay for PRTs to be issued. This is expected, however the PRT will be issued eventually, and subsequent desktops shouldn’t experience issues unless you have a limited number available.
*Please note: VMware still notes that ADFS is the preferred way for fast issuance of the PRT.
While VMware does recommend ADFS for performance when issuing PRTs, in my own testing I had no problems or complaints, however when deploying this in production I’d recommend that because of the PRT delay after deploying the pool or a new snapshot, to do this after hours or SSO will not function for some users who immediately get a new desktop.
Additional Considerations
Please note the following:
When switching from ADFS to Azure AD Connect, the sign-in process may change for users.
You must prepare the users for the change.
When using locally stored identifies and/or cached credentials, enabling Azure SSO may change the login process, or cause issues for users signing in.
You may have to delete saved credentials in the users persistent profile
You may have to adjust GPOs to account for Azure SSO
You may have to modify settings in your profile persistent solution
Example: “RoamIdentity” on FSLogix
I recommend testing before implementing
Test Environment
Test with new/blank user profiles
Test with existing users
If you’re coming from an environment that was previously using Seamless SSO for non-persistent VDI, you can create new test desktop pools that use newly created Active Directory OU containers and adjust the OU filtering appropriately to include the test OUs for synchronization to Azure AD with Azure AD Connect. This way you’re only syncing the test desktop pool, while allowing Seamless SSO to continue to function for existing desktop pools.
How to test Azure AD Hybrid Join, SSO, and PRT
To test the current status of Azure AD Hybrid Join, SSO, and PRT, you can use the following command:
dsregcmd /status
To check if the OS is Hybrid Domain joined, you’ll see the following:
+----------------------------------------------------------------------+
| Device State |
+----------------------------------------------------------------------+
AzureAdJoined : YES
EnterpriseJoined : NO
DomainJoined : YES
DomainName : DOMAIN
As you can see above, “AzureADJoined” is “YES”.
Further down the output, you’ll find information related to SSO and PRT Status:
+----------------------------------------------------------------------+
| SSO State |
+----------------------------------------------------------------------+
AzureAdPrt : YES
AzureAdPrtUpdateTime : 2023-07-23 19:46:19.000 UTC
AzureAdPrtExpiryTime : 2023-08-06 19:46:18.000 UTC
AzureAdPrtAuthority : https://login.microsoftonline.com/XXXXXXXX-XXXX-XXXXXXX
EnterprisePrt : NO
EnterprisePrtAuthority :
OnPremTgt : NO
CloudTgt : YES
KerbTopLevelNames : XXXXXXXXXXXXX
Here we can see that “AzureAdPrt” is YES which means we have a valid Primary Refresh Token issued by Azure AD SSO because of the Hybrid Join.
In this NVIDIA vGPU Troubleshooting Guide, I’ll help show you how to troubleshoot vGPU issues on VMware platforms, including VMware Horizon and VMware Tanzu. This guide applies to the full vGPU platform, so it’s relevant for VDI, AI, ML, and Kubernetes workloads.
This guide will provide common troubleshooting methods, along with common issues and problems associated with NVIDIA vGPU as well as their fixes.
Please note, there are numerous other additional methods available to troubleshoot your NVIDIA vGPU deployment, including 3rd party tools. This is a general document provided as a means to get started learning how to troubleshoot vGPU.
NVIDIA vGPU
NVIDIA vGPU is a technology platform that includes a product line of GPUs that provide virtualized GPUs (vGPU) for Virtualization environments. Using a vGPU, you can essentially “slice” up a physical GPU and distribute Virtual GPUs to a number of Virtual Machines and/or Kubernetes containers.
NVIDIA vGPU Installed in VMware ESXi Host
These virtual machines and containers can then use these vGPU’s to provide accelerated workloads including VDI (Virtual Desktop Infrastructure), AI (Artificial Intelligence), and ML (Machine Learning).
While the solution works beautifully, when deployed incorrectly or if the solution isn’t maintained, issues can occur requiring troubleshooting and remediation.
The NVIDIA vGPU driver comes with a utility called the “NVIDIA System Management Interface”. This CLI program allows you to monitor, manage, and query your NVIDIA vGPU (including non-vGPU GPUs).
NVIDIA vGPU “nvidia-smi” command
Simply running the command with no switches or flags, allow you to query and pull basic information on your vGPU, or multiple vGPUs.
For a list of available switches, you can run: “nvidia-smi -h”.
Running “nvidia-smi” on the ESXi Host
To use “nvidia-smi” on your VMware ESXi host, you’ll need to SSH in and/or enable console access.
When you launch “nvidia-smi” on the ESXi host, you’ll see information on the physical GPU, as well as the VM instances that are consuming a virtual GPU (vGPU). This usage will also provide information like fan speeds, temperatures, power usage and GPU utilization.
[root@ESXi-Host:~] nvidia-smi
Sat Mar 4 21:26:05 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07 Driver Version: 525.85.07 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A2 On | 00000000:04:00.0 Off | Off |
| 0% 36C P8 8W / 60W | 7808MiB / 16380MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2108966 C+G VM-WS02 3904MiB |
| 0 N/A N/A 2108989 C+G VM-WS01 3904MiB |
+-----------------------------------------------------------------------------+
This will aid with troubleshooting potential issues specific to the host or the VM. The following pieces of information are helpful:
Driver Version
GPU Fan and Temperature Information
Power Usage
GPU Utilization (GPU-Util)
ECC Information and Error Count
Virtual Machine VMs assigned a vGPU
vGPU Type (C+G means Compute and Graphics)
Additionally, instead of running once, you can issue “nvidia-smi -l x” replacing “x” with the number of seconds you’d like it to auto-loop and refresh.
Example:
nvidia-smi -l 3
The above would refresh and loop “nvidia-smi” every 3 seconds.
For vGPU specific information from the ESXi host, you can run:
nvidia-smi vgpu
root@ESXi-Host:~] nvidia-smi vgpu
Mon Mar 6 11:47:44 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07 Driver Version: 525.85.07 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA A2 | 00000000:04:00.0 | 0% |
| 3251713382 NVIDIA A2-4Q | 2321577 VMWS01 | 0% |
+---------------------------------+------------------------------+------------+
This command shows information on the vGPU instances currently provisioned.
There are also a number of switches you can throw at this to get even more information on vGPU including scheduling, vGPU types, accounting, and more. Run the following command to view the switches:
nvidia-smi vgpu -h
Another common switch I use on the ESXi host with vGPU for troubleshooting is: “nvidia-smi -q”, which provides lots of information on the physical GPU in the host:
[root@ESXi-HOST:~] nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Sat Mar 4 21:26:18 2023
Driver Version : 525.85.07
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 1
GPU 00000000:04:00.0
Product Name : NVIDIA A2
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
vGPU Device Capability
Fractional Multi-vGPU : Not Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Not Supported
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : XXXN0TY0SERIALZXXX
GPU UUID : GPU-de23234-3450-6456-e12d-bfekgje82743a
Minor Number : 0
VBIOS Version : 94.07.5B.00.92
MultiGPU Board : No
Board ID : 0x400
Board Part Number : XXX-XXXXX-XXXX-XXX
GPU Part Number : XXXX-XXX-XX
Module ID : 1
Inforom Version
Image Version : G179.0220.00.01
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x25B610DE
Bus Id : 00000000:04:00.0
Sub System Id : 0x157E10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16380 MiB
Reserved : 264 MiB
Used : 7808 MiB
Free : 8306 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 64 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 37 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 86 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 8.82 W
Power Limit : 60.00 W
Default Power Limit : 60.00 W
Enforced Power Limit : 60.00 W
Min Power Limit : 35.00 W
Max Power Limit : 60.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1770 MHz
Memory : 6251 MHz
Default Applications Clocks
Graphics : 1770 MHz
Memory : 6251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1770 MHz
SM : 1770 MHz
Memory : 6251 MHz
Video : 1650 MHz
Max Customer Boost Clocks
Graphics : 1770 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 650.000 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2108966
Type : C+G
Name : VM-WS02
Used GPU Memory : 3904 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2108989
Type : C+G
Name : VM-WS01
Used GPU Memory : 3904 MiB
As you can see, you can pull quite a bit of information in detail from the vGPU, as well as the VM processes.
Running “nvidia-smi” on the VM Guest
You can also run “nvidia-smi” inside of the guest VM, which will provide you information on the vGPU instance that is being provided to that specific VM, along with information on the guest VM’s processes that are utilizing the GPU.
“nvidia-smi” Running on Guest VM
This is helpful for providing information on the guest VM’s usage of the vGPU instance, as well as processes that require GPU usage.
Virtual Machine log files
Each Virtual Machine has a “vmware.log” file inside of the VM’s folder on the datastore.
To identify logging events pertaining to NVIDIA vGPU, you can search for the “vmiop” string inside of the vmware.log file.
The above will read out any lines inside of the log that have the “vmiop” string inside of them. The “-i” flag instructs grep to ignore case sensitivity.
This logs provide initialization information, licensing information, as well as XID error codes and faults.
ESXi Host log files
Additionally, since the ESXi host is running the vGPU Host Driver (vGPU Manager), it also has logs that pertain and assist with vGPU troubleshooting.
Some commands you can run are:
cat /var/log/vmkernel.log | grep -i vmiop
cat /var/log/vmkernel.log | grep -i nvrm
cat /var/log/vmkernel.log | grep -i nvidia
The above commands will pull NVIDIA vGPU related log items from the ESXi log files.
Using “dxdiag” in the guest VM
Microsoft has a tool called “dxdiag” which provides diagnostic infromation for testing and troubleshooting video (and sound) with DirectX.
I find this tool very handy for quickly verifying
NVIDIA vGPU with Microsoft DirectX “dxdiag” tool
As you can see:
DirectDraw Acceleration: Enabled
Direct3D Acceleration: Enabled
AGP Texture Acceleration: Enabled
DirectX 12 Ultimate: Enabled
The above show that hardware acceleration is fully functioning with DirectX. This is a indicator that things are generally working as expected. If you have a vGPU and one of the first three is showing as disabled, then you have a problem that requires troubleshooting. Additionally, if you do not see your vGPU card, then you have a problem that requires troubleshooting.
Please Note: You may not see “DirectX 12 Ultimate” as this is related to licensing.
Using the “VMware Horizon Performance Monitor”
The VMware Horizon Performance Monitor, is a great tool that can be installed by the VMware Horizon Agent, that allows you to pull information (stats, connection information, etc) for the session. Please note that this is not installed by default, and must be selected when running the Horizon Agent installer.
When it comes to troubleshooting vGPU, it’s handy to use this too to confirm you’re getting H.264 or H.265/HEVC offload from the vGPU instance, and also get information on how many FPS (Frames Per Second) you’re getting from the session.
VMware Horizon Performance Tracker with NVIDIA vGPU
Once opening, you’ll change the view above using the specified selector, and you can see what the “Encoder Name” is being used to encode the session.
Examples of GPU Offload “Encoder Name” types:
NVIDIA NvEnc HEVC 4:2:0 – This is using the vGPU offload using HEVC
NVIDIA NvEnc HEVC 4:4:4 – This is using the vGPU offload using HEVC high color accuracy
NVIDIA NvEnc H264 4:2:0 – This is using the vGPU offload using H.264
NVIDIA NvEnc H264 4:4:4 – This is using the vGPU offload using H.264 high color accuracy
Examples of Software (CPU) Session “Encoder Name” types:
BlastCodec – New VMware Horizon “Blast Codec”
h264 4:2:0 – Software CPU encoded h.264
If you’re seeing “NVIDIA NvEnc” in the encoder name, then the encoding is being offloaded to the GPU resulting in optimum performance. If you don’t see it, it’s most likely using the CPU for encoding, which is not optimal if you have a vGPU, and requires further troubleshooting.
NVIDIA vGPU Known Issues
Depending on the version of vGPU that you are running, there can be “known issues”.
When viewing the NVIDIA vGPU Documentation, you can view known issues, and fixes that NVIDIA may provide. Please make sure to reference the documentation specific to the version you’re running and/or the version that fixes the issues you’re experiencing.
Common Problems
There are a number of common problems that I come across when I’m contacted to assist with vGPU deployments.
Please see below for some of the most common issues I experience, along with their applicable fix/workaround.
XID Error Codes
When viewing your Virtual Machine VM or ESXi log file, and experiencing an XID error or XID fault, you can usually look up the error codes.
The table on this page allows you to lookup the XID code, find the cause, and also provides information if the issue is realted to “HW Error” (Hardware Error), “Driver Error”, “User App Error”, “System Memory Corruption”, “Bus Error”, “Thermal Issue”, or “FB Corruption”.
One can see XID code 45, as well as XID code 43, which after looking up on NVIDIA’s document, states:
XID 43 – GPU stopped processing
Possible Cause: Driver Error
Possible Cause: User App Error
XID 45 – Preemptive cleanup, due to previous errors — Most likely to see when running multiple cuda applications and hitting a DBE
Possible Cause: Driver Error
In the situation above, one can deduce that the issue is either Driver Error, Application Error, or a combination of both. In this specific case, you could try changing drivers to troubleshoot.
vGPU Licensing
You may experience issues in your vGPU deployment due to licensing issues. Depending on how you have you environment configured, you may be running in an unlicensed mode and not be aware.
In the event that the vGPU driver cannot obtain a valid license, it will run for 20 minutes with full capabilities. After that the performance and functionality will start to degrade. After 24 hours it will degrade even further.
Some symptoms of issues experienced when unlicensed:
Users experiencing laggy VDI sessions
Performance issues
Frames per Second (FPS) limited to 15 fps or 3 fps
Applications using OpenCL, CUDA, or other accelerated APIs fail
Additionally, some error messages and event logs may occur:
Event ID 2, “NVIDIA OpenGL Driver” – “The NVIDIA OpenGL driver has not been able to initialize a connection with the GPU.”
AutoCAD/Revit – “Hardware Acceleration is disabled. Software emulation mode is in use.”
“Guest is unlicensed”
Please see below for screenshots of said errors:
vGPU Guest Is UnlicensedNVIDIA OpenGL Driver Not FoundAutoCAD Hardware Acceleration Disabled
Additonally, when looking at the Virtual Machine VM vmware.log (inside of the VM’s folder on the ESXi datastore), you may see:
Guest is unlicensed. Cannot allocate more than 0x55 channels!
VGPU message 6 failed, result code: 0x1a
If this occurs, you’ll need to troubleshoot your vGPU licensing and resolve any issues occurring.
vGPU Type (vGPU Profile) mismatch
When using the default (“time-sliced”) vGPU deployment method, only a single vGPU type can be used on virtual machines or containers per physical GPU. Essentially all VMs or containers utilizing the physical GPU must use the same vGPU type.
If the physical GPU card has multiple GPUs (GPU chips), then a different type can be used on each physical GPU chip on the same card. 2 x GPUs on a single card = 2 different vGPU types.
Additionally, if you have multiple cards inside of a single host, the number of vGPU types you can deployed is based off the total number of GPUs across the total number of cards in your host.
If you configure multiple vGPU types and cannot support it, you will have issues starting VMs, as shown below:
Cannot power on VM with vGPU: Power on Failure, Insuffiecient resources
The error reads as follows:
Power On Failures
vCenter Server was unable to find a suitable host to power on the following virtual machines for the reasons listed below.
Insufficient resources. One or more devices (pciPassthru0) required by VM VDIWS01 are not available on host ESXi-Host.
Additionally, if provisioning via VMware Horizon, you may see: “NVIDIA GRID vGPU Support has detected a mismatch with the supported vGPUs”
Note: If you are using MIG (Multi Instance GPU), this does not apply as different MIG types can be applied to VMs from the same card/GPU.
vGPU or Passthrough with 16GB+ of Video RAM Memory
When attaching a vGPU to a VM, or passing through a GPU to a VM, with 16GB or more of Video RAM (Framebuffer memory), you may run in to a situation where the VM will not boot.
This is because the VM cannot map that large of memory space to be accesible for use.
Your users may report issues where their VDI guest VM freezes for a period of time during use. This could be caused due to VMware vMotion moving the virtual machine from one VMware ESXi host to another.
When experiencing issues, you may notice that “nvidia-smi” throws “ERR!” in the view. See the example below:
NVIDIA vGPU “nvidia-smi” reporting “ERR!”
This is an indicator that you’re in a fault or error state, and would recommend checking the ESXi Host log files, and the Virtual Machine log files for XID codes to identify the problem.
vGPU Driver Mismatch
When vGPU is deployed, drivers are installed on the VMware ESXi host (vGPU Manager Driver), as well as the guest VM virtual machine (guest VM driver).
NVIDIA vGPU Driver Mismatch
These two drivers must be compatible with each other. As per NVIDIA’s Documentation, see below for compatibility:
NVIDIA vGPU Manager with guest VM drivers from the same release
NVIDIA vGPU Manager with guest VM drivers from different releases within the same major release branch
NVIDIA vGPU Manager from a later major release branch with guest VM drivers from the previous branch
Additionally, if you’re using the LTS (Long Term Support Branch), the additional compatibility note applies.
NVIDIA vGPU Manager from a later long-term support branch with guest VM drivers from the previous long-term support branch
If you have a vGPU driver mismatch, you’ll likely see Event ID 160 from “nvlddmkm” reporting:
NVIDIA driver version mismatch error: Guest driver is incompatible with host drive.
To resolve this, you’ll need to change drivers on the ESXi host and/or Guest VM to a supported combination.
Upgrading NVIDIA vGPU
When upgrading NVIDIA vGPU drivers on the host, you may experience issues or errors stating that the NVIDIA vGPU modules or services are loaded and in use, stopping your ability to upgrade.
Normally an upgrade would be preformed by placing the host in maintenance mode and running:
However, this fails due to modules that are loaded and in use by the NVIDIA vGPU Manager Services.
Before attempting to upgrade (or uninstall and re-install), place the host in maintenance mode and run the following command:
/etc/init.d/nvdGpuMgmtDaemon stop
This should allow you to proceed with the upgrade and/or re-install.
VMware Horizon Black Screen
If you experiencing a blank or black screen when connecting to a VDI session with an NVIDIA vGPU on VMware Horizon, it may not even be related to the vGPU deployment.
To troubleshoot the VMware Horizon Black Screen, please review my guide on how to troubleshoot a VMware Horizon Blank Screen.
VM High CPU RDY (High CPU Ready)
CPU RDY (CPU Ready) is a state when a VM is ready and waiting to be scheduled on a physical host’s CPU. In more detail, the VM’s vCPUs are ready to be scheduled on the ESXi host’s pCPUs.
In rare cases, I have observed situations where VMs with a vGPU and high CPU RDY times, experience instability. I believe this is due to timing conflicts with the vGPU’s time slicing, and the VM’s CPU waiting to be scheduled.
To check VM CPU RDY, you can use one of the following methods:
Run “esxtop” from the CLI using the console or SSH
View the hosts performance stats on vCenter
Select host, “Monitor”, “Advanced”, “Chart Options”, de-select all, select “Readiness Average %”
When viewing the CPU RDY time in a VDI environment, generally we’d like to see CPU RDY at 3 or lower. Anything higher than 3 may cause latency or user experience issues, or even vGPU issues at higher values.
For your server virtualization environment (non-VDI and no vGPU), CPU Ready times are not as big of a consideration.
vGPU Profiles Missing from VMware Horizon
When using newer GPUs with older versions of VMware Horizon, you may encounter an issue with non-persistent instant clones resulting in a provisioning error.
This is caused by missing vGPU Types or vGPU Profiles, and requires either downloading the latest definitions, or possibly creating your own.
Today I want to talk about Memory Deduplication on ESXi with Transparent Page Sharing (TPS). This is a technology that isn’t widely known about, even amongst IT professionals with significant experience with VMware products.
And you may ask “Memory Deduplication, why aren’t we using this?!?” as it sounds like a pretty cool piece of technology… Well, I’m about to tell you why you’re not (Inter-VM), and share a few examples of where you would want to enable this.
I also want to show you how to enable TPS globally (Inter-VM), and also discuss TPS being used with VMware Horizon and VDI.
What is Transparent Page Sharing (TPS)?
Transparent Page Sharing is the process in which ESXi can provide memory deduplication by storing duplicate memory pages as a single page on the physical memory of the host. This process stops the system from storing redundant memory pages, and thus frees up physical memory for other uses.
If my memory serves me right, this was originally enabled by default in ESX/ESXi version 5, but was later globally disabled due to security vulnerabilities and concerns.
Note, that TPS is still enabled by default from within the same VM, even today with ESXi 8.
Security Concerns
I recall two potential scenarios and security concerns which led to VMware changing the original default behavior for TPS.
Scenario 1 included a concern about an attacker gaining access to a VM, and then having the ability to modify the memory contents of another VM.
Scenario 2 included a concern where an attacker may be able to get access to encryption keys used on another system.
With that being said, it sounds like this would be an extremely difficult attack that requires systems to be configured in a non-standard way.
Current status of TPS
Believe it or not, TPS and memory deduplication is still enabled, however it’s only deduplicating pages from within the same VM. TPS is not deduplicating pages from multiple VMs.
Additionally, VMware has given us controls to configure TPS to allow it amongst multiple VMs, or even enable it globally across the ESXi host.
See below for the settings to configure TPS on ESXi via “Advanced Settings”:
Transparent Page Sharing (TPS) Settings for ESXi Host
In short, you could enable TPS globally (Inter-VM) by setting “Mem.ShareForceSalting” in “Advanced Settings”, to a value of “0”. You can also use the salting to configure groups of VMs that are allow to share memory pages.
Additionally, you can tweak the behavior of TPS by modifying some of the settings shown below:
TPS Memory Sharing Settings
As you can see you can configure things like the scanning occurrence (Mem.ShareScanTime) of how often the system will check for memory pages that can be shared/deduplicated and other settings.
TPS is enabled, but not working
So, you may have decided to enable TPS in your environment, but you’re noticing that either no, or very little memory pages are being marked as shared.
TPS Memory Deduplication – Amount of host physical memory that backs shared guest physical memory
In the example above, you’ll notice that on a loaded host, with TPS enabled globally (Inter-VM, amongst all VMs), that the host is only deduplicating 1,052KB of memory.
This is because you will most often only see TPS being heavily utilized on an ESXi host that has over-committed memory, there’s also a chance that you simply don’t have enough memory pages that can be duplicated.
Memory Deduplication, TPS, and VMware Horizon VDI
Because VMware Horizon utilizes the “vmfork” with “Just-in-Time” desktop delivery, non-persistent VDI will benefit from some level of memory deduplication by default when using Instant Clones with non-persistent VDI. This is because non-persistent VDI guests are spawned from a running base image.
Additionally, you can further implement, enable, and configure TPS by configuring some Transparent Page Sharing options inside of the VMware Horizon Administration console.
When creating a Desktop Pool, you can set the “Transparent Page Sharing” open to “Virtual Machine” (Memory dedupe inside of the VM only), “Pool” (Memory dedupe across the Desktop Pool), “Pod” (Dedupe across the pod), or “Global” (Full Inter-VM memory deduplication across the ESXi host).
If you enabled TPS on the ESXi host globally, these settings are null and not used.
TPS Use Cases
So you might be asking when it’s a good time to use TPS?
The Homelab – When is a homelab not a good reason to try something? Looking to save some memory and overcommit memory resources? Implement TPS.
VDI Environments – On highly dense hosts, you may consider implementing TPS at some level to maximize the utilization of resources, however you must be aware of the security consequences and factor this in when configuring TPS.
Environments with no Sensitive Information – It’s hard to imagine, but if you have an environment that doesn’t contain any sensitive information and doesn’t use any security keys, it would be suitable to enable TPS.
I’m sure there’s a number of other use cases, so leave a comment if you can think of one.
Conclusion
In my opinion Transparent Page Sharing is a technology that should not be forgotten and discarded. VMware admins should be aware of it, how to configure it, and what the implications are of using it.
If you are considering enabling TPS in your environment, you must factor in the potential security consequences of doing so.
When either directly passing through a GPU, or attaching an NVIDIA vGPU to a Virtual Machine on VMware ESXi that has more than 16GB of Video Memory, you may run in to a situation where the VM fails to boot with the error “Module ‘DevicePowerOn’ power on failed.”. Special considerations are required when performing GPU or vGPU Passthrough with 16GB+ of video memory.
This issue is specifically caused by memory mapping a GPU or vGPU device that has 16GB of memory or higher, and could involve both the host system (the ESXi host) and/or the Virtual Machine configuration.
In this post, I’ll address the considerations and requirements to passthrough these devices to virtual machines in your environment.
In the order of occurrence, it’s usually VM configuration related, however if the recommendations in the “VM Configuration Considerations” section do not resolve the issue, please proceed to reviewing the “ESXi Host Considerations” section.
Please note that if the issue is host related, other errors may be present, or the device may not even be visible to ESXi.
VM GPU and vGPU Configuration Considerations
First and foremost, all new VMs should be created using the “EFI” Firmware type. EFI provides numerous advantages in device access and memory mapping versus the older style “BIOS” firmware types.
VM Firmware type EFI
To do this, create a new virtual machine, navigate to “VM Options”, expand “Boot Options”, and confirm/change the Firmware to “EFI”. I recommend this for all new VMs, and not only for VMs accessing GPUs or vGPUs with over 16GB of memory. Please note that you shouldn’t change an existing VM, and should do this on a fresh new VM.
With performing GPU or vGPU Passthrough with 16GB+ of video memory, you’ll need to create a couple of entries under “Advanced” settings to properly configure access to these PCIe devices and provide the proper environment for memory mapping. The lack of these settings is specifically what causes the “Module ‘DevicePowerOn’ power on failed.” error.
Under the VM settings, head over to “VM Options”, expand “Advanced” and click on “Edit Configuration”, click on “Add Configuration Params”, and add the following entries:
VM GPU and vGPU Memory Settings for 16GB or higher memory mapping
You’ll notice that while our GPU or vGPU profile may have 16GB of memory, we need to double that value, and set it for the “pciPassthru.64bitMMIOSizeGB” variable. If your card or vGPU profile had 32GB, you’d set it to “64”.
Additionally if you were passing through multiple GPUs or vGPU devices, you’d need to factor all the memory being mapped, and double the combined amount.
ESXi GPU and vGPU Host Considerations
On most new and modern servers, the host level doesn’t require any special configuration as they are already designed to pass through such devices to the hypervisor properly. However in some special cases, and/or when using older servers, you may need to modify configuration and settings in the UEFI or BIOS.
If setting the VM Configuration above still results in the same error (or possibly other errors), than you most likely need to make modifications to the ESXi hosts BIOS/UEFI/RBSU to allow the proper memory mapping of the PCIe device, in our case being the GPU.
This is where things get a bit tricky because every server manufacturer has different settings that will need to be configured.
Look for the following settings, or settings with similar terminology:
“Memory Mapping Above 4G”
“Above 4G Decoding”
“PCI Express 64-Bit BAR Support”
“64-Bit IOMMU Mapping”
Once you find the correct setting or settings, enable them.
Every vendor could be using different terminology and there may be other settings that need to be configured that I don’t have listed above. In my case, I had to go in to a secret “SERVICE OPTIONS” menu on my HPE Proliant DL360p Gen8, as documented here.
After performing the recommendations in this guide, you should now be able to passthrough devices with over 16GB of memory.
When it comes to troubleshooting login times with non-persistent VDI (VMware Horizon Instant Clones), I often find delays associated with printer drivers not being included in the golden image. In this post, I’m going to show you how to add a printer driver to an Instant Clone golden image!
Printing with non-persistent VDI and Instant Clones
In most environments, printers will be mapped for users during logon. If a printer is mapped or added and the driver is not added to the golden image, it will usually be retrieved from the print server and installed, adding to the login process and ultimately leading to a delay.
Due of the nature of non-persistent VDI and Instant Clones, every time the user goes to login and get’s a new VM, the driver will then be downloaded and installed each of these times, creating a redundant process wasting time and network bandwidth.
To avoid this, we need to inject the required printer drivers in to the golden image. You can add numerous drivers and should include all the drivers that any and all the users are expecting to use.
An important consideration: Try using Universal Print Drivers as much as possible. Universal Printer Drivers often support numerous different printers, which allows you to install one driver to support many different printers from the same vendor.
How to add a printer driver to an instant clone golden image
Below, I’ll show you how to inject a driver in to the Instant Clone golden image. Note that this doesn’t actually add a printer, but only installs the printer driver in to the Windows operating system so it is available for a printer to be configured and/or mapped.
Let’s get started! In this example we’ll add the HP Universal Driver. These instructions work on both Windows 10 and Windows 11 (as well as Windows Server operating systems):
Click Start, type in “Print Management” and open the “Print Management”. You can also click Start, Run, and type “printmanagement.msc”.
On the left hand side, expand “Print Servers”, then expand your computer name, and select “Drivers”.
Right click on “Drivers” and select “Add Driver”.
When the “Welcome to the Add Printer Driver Wizard” opens, click Next.
Leave the default for the architecture. It should default to the architecture of the golden image.
When you are at the “Printer Driver Selection” stage, click on “Have Disk”.
Browse to the location of your printer driver. In this example, we navigate to the extracted HP Universal Print Driver.
Select the driver you want to install.
Click on Finish to complete the driver installation.
The driver you installed should now appear in the list as it has been installed in to the operating system and is now available should a user add a printer, or have a printer automatically mapped.
Printer Driver installed on Non-Persistent Instance Clone Golden Image
Now seal, snap, and deploy your image, and you’re good to go!
It’s been coming for a while: The requirement to deploy VMs with a TPM module… Today I’ll be showing you the easiest and quickest way to create and deploy Virtual Machines with vTPM on VMware vSphere ESXi!
As most of you know, Windows 11 has a requirement for Secureboot as well as a TPM module. It’s with no doubt that we’ll also possibly see this requirement with future Microsoft Windows Server operating systems.
While users struggle to deploy TPM modules on their own workstations to be eligible for the Windows 11 upgrade, ESXi administrators are also struggling with deploying Virtual TPM modules, or vTPM modules on their virtualized infrastructure.
What is a TPM Module?
TPM stands for Trusted Platform Module. A Trusted Platform Module, is a piece of hardware (or chip) inside or outside of your computer that provides secured computing features to the computer, system, or server that it’s attached to.
This TPM modules provides things like a random number generator, storage of encryption keys and cryptographic information, as well as aiding in secure authentication of the host system.
In a virtualization environment, we need to emulate this physical device with a Virtual TPM module, or vTPM.
What is a Virtual TPM (vTPM) Module?
A vTPM module is a virtualized software instance of a traditional physical TPM module. A vTPM can be attached to Virtual Machines and provide the same features and functionality that a physical TPM module would provide to a physical system.
vTPM modules can be can be deployed with VMware vSphere ESXi, and can be used to deploy Windows 11 on ESXi.
Deployment of vTPM modules, require a Key Provider on the vCenter Server.
Deploying vTPM (Virtual TPM Modules) on VMware vSphere ESXi
In order to deploy vTPM modules (and VM encryption, vSAN Encryption) on VMware vSphere ESXi, you need to configure a Key Provider on your vCenter Server.
Traditionally, this would be accomplished with a Standard Key Provider utilizing a Key Management Server (KMS), however this required a 3rd party KMS server and is what I would consider a complex deployment.
VMware has made this easy as of vSphere 7 Update 2 (7U2), with the Native Key Provider (NKP) on the vCenter Server.
The Native Key Provider, allows you to easily deploy technologies such as vTPM modules, VM encryption, vSAN encryption, and the best part is, it’s all built in to vCenter Server.
Enabling VMware Native Key Provider (NKP)
To enable NKP across your vSphere infrastructure:
Log on to your vCenter Server
Select your vCenter Server from the Inventory List
Select “Key Providers”
Click on “Add”, and select “Add Native Key Provider”
Give the new NKP a friendly name
De-select “Use key provider only with TPM protected ESXi hosts” to allow your ESXi hosts without a TPM to be able to use the native key provider.
In order to activate your new native key provider, you need to click on “Backup” to make sure you have it backed up. Keep this backup in a safe place. After the backup is complete, you NKP will be active and usable by your ESXi hosts.
VMware vCenter with Native Key Provider (NKP) Configured
There’s a few additional things to note:
Your ESXi hosts do NOT require a physical TPM module in order to use the Native Key Provider
Just make sure you disable the checkbox “Use key provider only with TPM protected ESXi hosts”
NKP can be used to enable vTPM modules on all editions of vSphere
If your ESXi hosts have a TPM module, using the Native Key Provider with your hosts TPM modules can provide enhanced security
Onboard TPM module allows keys to be stored and used if the vCenter server goes offline
If you delete the Native Key Provider, you are also deleting all the keys stored with it.
Make sure you have it backed up
Make sure you don’t have any hosts/VMs using the NKP before deleting
You can now deploy vTPM modules to virtual machines in your VMware environment.
When performing a VMware vMotion on a Virtual Machine with an NVIDIA vGPU attached to it, the VM may freeze during migration. Additionally, when performing a vMotion on a VM without a vGPU, the VM does not freeze during migration.
So why is it that adding a vGPU to a VM causes it to become frozen during vMotion? This is referred to as the VM Stun Time.
I’m going to explain why this happens, and what you can do to reduce these STUN times.
VMware vMotion
First, let’s start with traditional vMotion without a vGPU attached.
VMware vMotion with vSphere
vMotion allows us to live migrate a Virtual Machine instance from one ESXi host, to another, with (visibly) no downtime. You’ll notice that I put “visibly” in brackets…
When performing a vMotion, vSphere will migrate the VM’s memory from the source to destination host and create checkpoints. It will then continue to copy memory deltas including changes blocks after the initial copy.
Essentially vMotion copies the memory of the instance, then initiates more copies to copy over the changes after the original transfer was completed, until the point where it’s all copied and the instance is now running on the destination host.
VMware vMotion with vGPU
For some time, we have had the ability to perform a vMotion with a VM that as a GPU attached to it.
VMware VMs with vGPU
However, in this situation things work slightly different. When performing a vMotion, it’s not only the system RAM memory that needs to be transferred, but the GPU’s memory (VRAM) as well.
Unfortunately the checkpoint/delta transfer technology that’s used with then system RAM isn’t available to transfer the GPU, which means that the VM has to be stunned (frozen) to stop it so that the video RAM can be transferred and then the instance can be initialized on the destination host.
STUN Time
The STUN time is essentially the time it takes to transfer the video RAM (framebuffer) from one host to another.
When researching this, you may find examples of the time it takes to transfer various sizes of VRAM. An example would be from VMware’s documentation “Using vMotion to Migrate vGPU Virtual Machines“:
Expected STUN Times for vMotion with vGPU on 10Gig vMotion NIC
However, it will always vary depending on a number of factors. These factors include:
vMotion Network Speed
vMotion Network Optimization
Multi-NIC vMotion to utilize multiple NICs
Multi-vmk vMotion to optimize and saturate single NICs
Server Load
Network Throughput
The number of VM’s that are currently being migrated with vMotion
As you can see, there’s a number of things that play in to this. If you have a single 10Gig link for vMotion and you’re migrating many VMs with a vGPU, it’s obviously going to take longer than if you were just migrating a single VM with a vGPU.
Optimizing and Minimizing vGPU STUN Time
There’s a number of things we can look at to minimize the vGPU STUN times. This includes:
Upgrading networking throughput with faster NICs
Optimizing vMotion (Configure multiple vMotion VMK adapters to saturate a NIC)
All of the above can be implemented together, which I would actually recommend.
In short, the faster we migrate the VM, the less the STUN Time will be. Check out my blog post on Optimizing VMware vMotion which includes how to perform the above recommendations.
Whether deploying VDI for the first time or troubleshooting existing Azure AD SSO issues for an existing environment, special consideration must be made for Microsoft Azure AD SSO and VDI.
Microsoft Azure Active Directory has two different methods for handling SSO (Single Sign On), these include SSO via a Primary Refresh Token (PRT) and Azure Seamless SSO. In this post, I’ll explain the differences, and when to use which one.
Microsoft Azure AD SSO and VDI
What does Azure AD SSO do?
Azure AD SSO allows your domain joined Windows workstations (and Windows Servers) to have a Single Sign On experience so that users can have an single sign-on integrated experience when accessing Microsoft 365 and/or Office 365.
When Azure AD SSO is enabled and functioning, your users will not be prompted nor have to log on to Microsoft 365 or Office 365 applications or services (including web services) as all this will be handled transparently in the background with Azure AD SSO.
For VDI environments, especially non-persistent VDI (VMware Instant Clones), this is an important function so that users are not prompted to login every time they launch an Office 365 application.
Persistent VDI is not complex and doesn’t have any special considerations for Azure AD SSO, as it will function the same way as traditional workstations, however non-persistent VDI requires special planning.
Please Note: Organizations often associate the Office 365 login prompts to activation issues when in fact activation is functioning fine, however Azure AD SSO is either not enabled, incorrect configured, or not functioning which is why the users are being prompted for login credentials every time they establish a new session with non-persistent VDI. After reading this guide, it should allow you to resolve the issue of Office 365 login prompts on VDI non-persistent and Instant Clone VMs.
Azure AD SSO methods
There are two different ways to perform Azure AD SSO in an environment that is not using ADFS. These are:
Azure AD SSO via Primary Refresh Token
Azure AD Seamless SSO
Both accomplish the same task, but were created at different times, have different purposes, and are used for different scenarios. We’ll explore this below so that you can understand how each works.
Hybrid Azure AD Joined Login
Fun fact: You can have both Azure AD SSO via PRT and Azure AD Seamless SSO configured at the same time to service your Active Directory domain, devices, and users.
Azure SSO via Primary Refresh Token
When using Azure SSO via Primary Refresh Token, SSO requests are performed by Windows Workstations (or Windows Servers), that are Hybrid Azure AD Joined. When a device is Hybrid Azure AD Joined, it is joined both to your on-premise Active Directory domain, as well registered to your Azure Active Directory.
Azure SSO via Primary Refresh token requires the Windows instance to be running Windows 10 (or later), and/or Windows Server 2016 (or later), as well the Windows instance has to be Azure Hybrid AD joined. If you meet these requirements, SSO with PRT will be performed transparently in the background.
If you require your non-persistent VDI VMs to be Hybrid Azure AD joined and require Azure AD SSO with PRT, special considerations and steps are required:
This includes:
Scripts to automatically unjoin non-persistent (Instant Clone) VDI VMs from Azure AD on logoff.
Scripts to cleanup old entries on Azure AD
If you properly deploy this, it should function. If you don’t require your non-persistent VDI VMs to be Hybrid Azure AD joined, then Azure AD Seamless SSO may be better for your environment.
Microsoft Azure AD Seamless SSO after configured and implemented, handles Azure AD SSO requests without the requirement of the device being Hybrid Azure AD joined.
Seamless SSO works on Windows instances instances running Windows 7 (or later, including Windows 10 and Windows 11), and does NOT require the the device to be Hybrid joined.
Seamless SSO allows your Windows instances to access Azure related services (such as Microsoft 365 and Office 365) and provides a single sign-on experience.
This may be the easier method to use when deploying non-persistent VDI (VMware Instant Clones), if you want to implement SSO with Azure, but do not have the requirement of Hybrid AD joining your devices.
Additionally, by using Seamless SSO, you do not need to implement the require log-off and maintenance scripts mentioned in the above section (for Azure AD SSO via PRT).
To use Azure AD Seamless SSO with non-persistent VDI, you must configure and implement Seamless SSO, as well as perform one of the following to make sure your devices do not attempt to Hybrid AD join:
Exclude the non-persistent VDI computer OU containers from Azure AD Connect synchronization to Azure AD
Implement a registry key on your non-persistent (Instant Clone) golden image, to disable Hybrid Azure AD joining.
To disable Hybrid Azure AD join on Windows, create the registry key on your Windows image below:
Different methods can be used to implement SSO with Active Directory and Azure AD as stated above. Use the method that will be the easiest to maintain and provide support for the applications and services you need to access. And remember, you can also implement and use both methods in your environment!
After configuring Azure AD SSO, you’ll still be required to implement the relevant GPOs to configure Microsoft 365 and Office 365 behavior in your environment.
Additional Resources
Please see below for additional information and resources:
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.
Do you accept the use of cookies and accept our privacy policy? AcceptRejectCookie and Privacy Policy
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.