May 262024
 
NVIDIA vGPU

When using Omnissa Horizon (formerly VMware Horizon), you may note that NVENC offload is disabled when using RDSH with NVIDIA vGPU. This may also affect other VDI and Application Delivery platforms that use RDSH (Remote Desktop Session Hosts) and NVIDIA vGPU (Virtual GPU).

One of the key benefits of deploying NVIDIA vGPU with Omnissa Horizon, is being able to use the NVIDIA NvENC (NVIDIA Encoder) to hardware encode your VDI session. This is also known as H264/H265/HEVC/AV1 offload.

This means that the encoding and compression of the remoted video session is handled by the GPU, instead of the CPU, freeing up resources on the VM guest and host, reducing latency with encoding, and also providing a much better user experience.

The Observation

When deploying NVIDIA vGPU with vApps and Horizon Apps, you’ll note the following in the VMware Horizon Performance Tracker:

VMware Horizon Performance Tracker on RDSH showing software encoder

You can see above that the “Encoder Name” is using “h264 4:2:0”. This means that the CPU Software Based encoder is handing the encoding of the H264 BLAST Session. While the environment is 3D accelerated, the remoting protocol encoding is not hardware offloaded.

You’ll also note the following:

  • VMware Horizon Agent High CPU Usage
  • “nvidia-smi” on the host and VM does not report the encoder being used

This behavior is as expected due to the inability of RDS session hosts to be able to utilize NvENC. RDSH hosts utilize a software framebuffer for user environment and desktop delivery which cannot be used with NVENC.

Solution and/or Workaround

To work around this limitation, you have the option of using VDI desktops (in this case it would be preferable to use non-persistent Instant Clones) to deploy an “Application Pool” with vGPU enabled VMs.

Note that this is a major change to your solution architecture because pushing applications (and desktops) from Windows 10 or Windows 11 Guest VMs is a 1 to 1 relation, versus RDSH which supports many users to one VM.

Using Horizon, you could then push applications (not desktops) from these vGPU enabled Instant Clones, which would support NVENC and hardware offload, as shown in the example below:

VMware Horizon Performance Tracker showing NVIDIA NvEnc Hardware encoder on instant clone

In the image above, you’ll note that the “Encoder Name” is “NVIDIA NvEnc HEVC 4:2:0” showing us that NvEnc hardware offload and encoding is functioning and being used.

Note, that using this method to deploy Horizon Apps will result in more framebuffer being required, however may be offset since a smaller framebuffer can be used with individual VMs versus a large framebuffer being assigned and attached to an RDSH host.

May 252024
 
VDI Gaming Demo with NVIDIA vGPU and Omnissa Horizon

Here’s a fun quick VDI Gaming Demo with NVIDIA vGPU and Omnissa Horizon 8, using an NVIDIA L4 GPU and the L4-12Q Profile.

This video is just for fun, and is just to show some of the capabilities of the technology, hardware, and software, in this case, with Cloud Gaming.

The NVIDIA vGPU solution provides the ability to “slice” and create multiple Virtual GPU (vGPU) devices for your Virtual Machines and Virtual workloads.

In this video:

  • Quick Introduction to NVIDIA vGPU with Omnissa Horizon 8
  • Validating NVIDIA vGPU functionality (with DirectX Diagnostics, Horizon Performance Monitor Tracker)
  • MechWarrior 5 Cloud Gaming
  • Heaven Benchmark

Environment Details:

  • 2 x HPE DL360p Gen8 Servers (2 x 10 Core Procs, 384GB of RAM)
    • 1 Server with NVIDIA A2
    • 1 Server with NVIDIA L4
  • VMware vSphere 8U2
  • Omnissa Horizon 8

Hope you enjoy the video and demo!

May 092024
 
NVIDIA vGPU Network Licensing Token

When deploying NVIDIA vGPU across a VDI environment, I often see IT teams deploy the licensing token directly on the persistent VMs, or on the non-persistent base golden image. This often causes a nightmare when the client activation token must be updated.

I highly recommend considering network placement of the NVIDIA vGPU Licensing Client Configuration token file for your deployments.

In this post we’ll review the Client Configuration Token File, why you’d want to place it on the network, and how to do so.

What is the Client Configuration Token File

The Client Configuration Token File, tells the NVIDIA vGPU driver on your VM where to find the licensing server information. This token will point the driver to either the CLS or DLS licensing server and request the applicable license to be issued.

By default, the vGPU driver will check the following location for the token:

C:\Program Files\NVIDIA Corporation\vGPU Licensing\ClientConfigToken\

While this is common, there’s a much better (and easier) method that you can use to deploy the Client Configuration Tokens, using Network Shares, to ease management of these files.

Placing the NVIDIA vGPU Licensing client configuration token on a network share

Using the Windows Registry, along with a GPO (Group Policy Object), you can configure a network location for the NVIDIA Client Configuration Token, so that your systems whether Persistent or Non-Persistent will use this location.

In the event of a token change, you can simply delete and remove the old token, and place a new configuration token, and all the systems will have immediate access to it, without manually updating individual systems.

Here we’ll use the registry and a GPO to configure the token location:

  1. Using an administrative account, create a folder called “vGPU-Licensing” on your domain SYSVOL share.
    • Example: \\Domain.com\SYSVOL\Domain.com\vGPU-Licensing\
  2. Place your NVIDIA Licensing Client Configuration Token in this folderNVIDIA Licensing Token SYSVOL
  3. Open “Group Policy Management” and create a new GPO called “VDI-NVIDIA-LicensingToken”
  4. Navigate to: Computer Configuration -> Preferences -> Windows Settings -> Registry
  5. Right Click and select New -> Registry Item
  6. Under the New Registry Window Enter the following:
    • Action: Update
    • Hive: HKEY_LOCAL_MACHINE
    • Key Path: SYSTEM\CurrentControlSet\Services\nvlddmkm\Global\GridLicensing
    • Value Name: ClientConfigTokenPath
    • Value Type: REG_SZ
    • Value Data: \\Domain.com\SYSVOL\Domain.com\vGPU-Licensing
    • Change the network location to match your environment and your setup
  7. After populating the fields, it should be similar to the following example: NVIDIA GPO Registry Client Configuration Token
  8. Hit Apply, then Ok, then link the newly created GPO to the OU where your VDI VM guests are located with NVIDIA vGPU.

That’s it! All we did was created a GPO which configures the Registry key “ClientConfigTokenPath” inside of HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\nvlddmkm\Global\GridLicensing\ and set it to a network share that has the configuration tokens.

Please note, the NVIDIA licensing service accesses the network location using the services security context (not the user’s context), which is why I chose the SYSVOL share, as the computer accounts have read access to this location (example, reading the GPOs on boot and user logon).

Additionally, note that the registry key and location may vary if you’re using older versions of the NVIDIA vGPU Driver. The key used in this post is for versions 16.x and 17.x.

May 092024
 
NVIDIA vGPU

You may notice a frozen session or frozen screen with NVIDIA vGPU, Windows 11, and VMware Horizon in your VDI environment.

While I’ve mostly observed this issue using non-persistent Instant Clones with vGPU on Windows 11 23H2, I have also noticed issues and anomalies with persistent VMs as well.

I’ve noticed this issue across multiple customer environments, and was able to replicate it in my own environment. I’ll go over the problem and solution below.

The Problem

This issue occurs due to the combination of hardware being used, the VMware SVGA driver, a secondary “Virtual Display”, and the resolution being set during logon and initialization of the VMware Horizon VDI session.

When a user logs on, the resolutions are set across all virtual displays. There is an issue where due to a timeout (observed in log files), the resolution cannot be set, resulting in a session that either appears to be frozen, or if active, the interactive cursor is actually off-set from the visible display (your mouse is somewhere else, other than where it’s being displayed).

The Solution

In my troubleshooting, I’ve identified the following solutions:

Solution #1

To resolve this issue, disable the “VMware SVGA 3D” Display Adapter in the Windows Explorer (as shown below). Simply right-click on “VMware SVGA 3D” and set to Disabled.

After disabling this Display Adapter, you’ll noticed the issue will be resolved, and you’ll also notice your VDI sessions are established very quickly (including initializing the resolutions with vGPU).

If you’re using non-persistent VDI (VMware Horizon Instant Clones), you’ll need to perform this on your base image.

Note: By disabling this adapter, you will lose the ability to use the VMware Console on VMware vSphere vCenter. To gain console access, you’ll either need to enable the VMware SVGA 3D adapter in a VDI session, or remove the vGPU adapter.

Solution #2

Another solution is to force the VDI session to use the VMware Horizon Indirect Display Driver.

  1. Open Windows Registry and navigate to the following location: HKLM\Software\Policies\VMware, Inc.\VMware Blast\Config
  2. Create a new Registry String (REG_SZ) called “PixelProviderForceViddCapture” and set it to: 1

Note: If you force the use of the VMware Horizon Indirect Display Driver as your Primary Display Driver, you may run in to GPU issues with the VMware Horizon Indirect Display Driver where the capabilities of your NVIDIA vGPU may not be detected by your applications that require the features and capabilities that come from an NVIDIA GPU.

Jan 062024
 
vMotion with vGPU

Normally, any VMs that are NVIDIA vGPU enabled have to be manually migrated with manual vMotion if a host is placed in to maintenance mode, to evacuate the host. While we may have grown accustomed to this, there is a better way, with vGPU Enabled VM DRS Evacuation during Maintenance mode!

A new feature that was introduced with vSphere 7.0 U3f, was the ability to configure and allow automatic vMotion of VMs with vGPUs, meaning that DRS can now migrate your VDI and AI/ML vGPU enabled workloads when hosts are placed in to maintenance mode. This also allows you to streamline remediation with vLCM when updating vGPU enabled hosts running vGPU enabled VMs.

Additionally, as of vSphere 8.0 U2, DRS can now estimate the STUN times required for vMotion of vGPU enabled VMs, and control whether automatic DRS vMotion’s are allowed. This STUN time limit can be set buy an administrator.

Enable automatic vMotion evacuation of vGPU enabled VMs

To enable the automatic vMotion of vGPU enabled VMs on your vSphere Cluster:

  1. Navigate to your vSphere Cluster.vSphere Cluster Selected
  2. Click on the “Configure” Tab, and then select “vSphere DRS”, and click “Edit”.vSphere DRS Cluster - DRS Advanced Settings
  3. Navigate to the “Advanced Options” tab.
  4. Add “VgpuMMAutomationTimeoutSecs” and set to “-1”.vSphere DRS set VgpuMMAutomationTimeoutSecs

After performing the above, when you place a host with vGPU enabled Virtual Machines in to Maintenance Mode, vSphere DRS will evacuate and migrate the VMs to other hosts in the cluster that have the required hardware.

If you attempt to place a host in to Maintenance Mode without enabling automatic vMotion of vGPU enabled VMs, it will fail with the error: “DRS failed to generate a vMotion recommendation for a virtual machine on a host entering Maintenance Mode“.

Enable and Configure vGPU STUN Time Estimate and Limits

If you are running vSphere 8U2 or higher, you can enable vGPU STUN time estimation and limits for DRS on the vGPU enabled cluster. Similar to the instructions above, we can add and configure two variables to the vSphere DRS cluster “Advanced Options”.

To enable STUN time estimation, add PassthroughDrsAutomation and set to “1”.

To override the default vMotion STUN time limit of 100 seconds, add VmDevicesStunTimeTolerated and set it to your preferred maximum number of seconds. Alternatively, you can set this limit Per VM by navigating to the VM in vSphere and adding this variable under the “VM Options” “Advanced Settings” section.

Additional Documentation