Jan 112024
 
ESXi-Host-Decommission

While most of us frequently deploy new ESXi hosts, a question and task not oftenly discussed is how to properly decommission a VMware ESXi host.

Some might be surprised to learn that you cannot simply just power down and remove the host from the vCenter server, as there are a number of steps that must be taken beforehand to ensure a proper successful decommission. Properly decommissioning the ESXi host avoids orphaned objects in the vCenter database, which can sometimes cause problems in the future.

Today we’ll go over how to properly decommission a VMware ESXi host in an environment with VMware vCenter Server.

The Process – How to decommission ESXi

We will detail the process and considerations to decommission an ESXi host. We will assume that you have since migrated all your VMs, templates, and files from the host, and it contains no data that requires backup or migration.

VMware ESXi Host Decommission Procedures
VMware ESXi Host Decommission Procedures

Process in Short:

  1. Enter Maintenance Mode
  2. Remove Host from vDS Switches
  3. Unmount and Detach iSCSI LUNs
  4. Move host from cluster to datacenter as standalone host
  5. Remove Host from Inventory

Please read further for extended procedures and more information.

Enter Maintenance Mode

We enter maintenance mode to confirm that no VMs are running on the host. You can simply right click the host, and enter maintenance mode.

Remove Host from vDS Switches

You must gracefully remove the host from any vDS switches (VMware Distributed Switches) before removing the host from vCenter Server.

You can create a standard vSwitch and migrate vmk (VMware Kernel) adapters from the vDS switch to standard vSwitch, to maintain communication with the vCenter server and other networks.

Please Note: If you are using vDS switches for iSCSI connectivity, you must gracefully develop a plan to deal with this beforehand, either by unmounting/detaching the iSCSI LUNs on the vDS before removing the switch, or gracefully migrating the vmk adapters to a standard vSwitch, using MPIO to avoid losing connectivity during the process.

Unmount and Detach iSCSI LUNs

You can now proceed to unmount and detach iSCSI LUNs from the selected system:

  1. Unmount the iSCSI LUN(s) from the host
  2. Detach the iSCSI LUN(s) from the host

You will unmount only on the selected host to be decommissioned, and then detach the LUNs (again only on the host you are decommissioning).

Move Host from Cluster to Datacenter as standalone host

While this may not be required, I usually do this to let vSphere Cluster Services (HA/DRS) adjust for the host removal, and also deal with reconfiguration of the HA agent on the ESXi Host. You can simply move the host from the cluster to the parent datacenter level.

Remove Host from Inventory

Once the host has been moved and a moment or two have elapsed, you can now proceed to remove the host from inventory.

While the host is powered on and still connected to vCenter, right click on the host and choose “Remove from Inventory”. This will gracefully remove objects from vCenter, and also uninstall the HA agent from the ESXi host.

Host Repurposing

At this point, you can now log directly on to the ESXi host using the local root password, and shutdown the host.

Dec 082023
 
vCenter-Root-CA-Missing

Today we’ll go over how to install the vSphere vCenter Root Certificate on your client system.

Certificates are designed to verify the identity of the systems, software, and/or resources we are accessing. If we aren’t able to verify and authenticate what we are accessing, how do we know that the resource we are sending information to, is really who they are?

Installing the vSphere vCenter Root Certificate on your client system, allows you to verify the identity of your VMware vCenter server, VMware ESXi hosts, and other resources, all while getting rid of those pesky certificate errors.

Certificate warning when connecting to vCenter vCSA
Certificate warning when connecting to vCenter vCSA

I see too many VMware vSphere administrators simply dismiss the certificate warnings, when instead they (and you) should be installing the Root CA on your system.

Why install the vCenter Server Root CA

Installing the vCenter Server’s Root CA, allows your computer to trust, verify, and validate any certificates issued by the vSphere Root Certification authority running on your vCenter appliance (vCSA). Essentially this translates to the following:

  • Your system will trust the Root CA and all certificates issued by the Root CA
    • This includes: VMware vCenter, vCSA VAMI, and ESXi hosts
  • When connecting to your vCenter server or ESXi hosts, you will not be presented with certificate issues
  • You will no longer have vCenter OVF Import and Datastore File Access Issues
    • This includes errors when deploying OVF templates
    • This includes errors when uploading files directly to a datastore
File Upload in vCenter to ESXi host operation failed

In addition to all of the above, you will start to take advantage of certificate based validation. Your system will verify and validate that when you connect to your vCenter or ESXi hosts, that you are indeed actually connecting to the intended system. When things are working, you won’t be prompted with a notification of certificate errors, whereas if something is wrong, you will be notifying of a possible security event.

How to install the vCenter Root CA

To install the vCenter Root CA on your system, perform the following:

  1. Navigate to your VMware vCenter “Getting Started” page.
    • This is the IP or FQDN of your vCenter server without the “ui” after the address. We only want to access the base domain.
    • Do not click on “Launch vSphere Client”.
  2. Right click on “Download trusted root CA certificates”, and click on save link as.
    Link to download vCenter trusted root CA Certificates
  3. Save this ZIP file to your computer, and extract the archive file
    • You must extract the ZIP file, do not open it by double-clicking on the ZIP file.
  4. Open and navigate through the extracted folders (certs/win in my case) and locate the certificates.
    VMware vCenter Root Certificates
  5. For each file that has the type of “Security Certificate”, right click on it and choose “Install Certificate”.
  6. Change “Store Location” to “Local Machine”
    • This makes your system trust the certificate, not just your user profile
  7. Choose “Place all certificates in the following store”, click Browse, and select “Trusted Root Certification Authorities”.
    Screenshot to Place in Trusted Root Certification Authorities
  8. Complete the wizard. If successful, you’ll see: “The import was successful.”.
  9. Repeat this for each file in that folder with the type of “Security Certificate”.

Alternatively, you can use a GPO with Active Directory or other workstation management techniques to deploy the Root CAs to multiple systems or all the systems in your domain.

Jul 252023
 

When it comes to virtualized workloads, one thing I commonly see overlooked in the design of the solution, is the placement of workloads. In this post, I want to cover VMware vSphere VM placement rules using the “VM/Host Rules” feature.

This is a feature that I commonly see overlooked and not configured, especially in smaller single cluster environments, however I’ve also seen this happen in very large scale environments as well.

Let’s cover the why, what, who, and how…

VM Workloads

While VMware vSphere does have a number of technologies built in for redundancy, load-balancing, and availability, as part of the larger solution we often find our workloads, specifically 3rd party platforms, with their own solutions that accomplish the same thing.

We need to identify which HA (High Availability) or redundancy solution to use, based on the application, service, and how it works.

For example, using VMware vSphere HA, or High Availability, if vCenter (and/or vCLS) detects a host goes offline, it can restart the workload on other online hosts. There is time associated with the detection and boot time, resulting in a loss of service during this period.

Third party solutions often have their own high availability or redundancy built in to the solution, such as Microsoft Active Directory. In this case with a standard configuration, at any time, any domain controller can respond to a clients request for resources. If one DC goes offline, other DCs can respond to the request resulting in no downtime.

Obviously, in the case of Active Directory Domain Controllers, you’d much prefer to have multiple DCs in your environment, instead of using one with vSphere HA.

Additionally, if you did have multiple domain controllers, you’d want to make sure they aren’t all placed on the same ESXi host. This is where we start to incorporate VM placement in to our solution.

VM Placement

When it comes to 3rd party solutions like mentioned above, we need to identify these workloads and factor them in to the design of the solution we are either implementing, maintaining, or improving.

Example of VM workloads used with VM Placement

A few examples of these workloads with their own load-balancing and availability technologies:

  • Microsoft Windows Active Directory Domain Controllers
  • Microsoft Windows Servers running DNS/DHCP Servers
  • Virtualized Active/Active or Active/Passive Firewall Appliances
  • VMware Horizon UAG (Unified Access Gateway) configured in HA mode
  • Other servers/services that have their own availability systems

As you can see, the applications all have their own special solution for availability, so we must insure the different “nodes” or “instances” are running on different ESXi hosts to avoid a host failure bringing down the entire solution.

Unless otherwise specified by the 3rd party vendor, I would recommend using VM/Host Rules in combination with vSphere DRS and HA.

Configuring VM Placement with VM/Host Rules

To configure these rules, follow the instructions below:

  1. Log on to your VMware vCenter Server
  2. Select a Cluster
  3. Click on the “Configure” tab, and then “VM/Host Rules”
    • Here you can Add/Edit/Delete VM Host Rules
  4. Click on “Add”, and give the rule a new name (Example: Domain Controllers)
  5. For “Type”, select “Separate Virtual Machines”
  6. Click “Add” and select your Domain Controllers and add them to the rule.
Screenshot rule creation for VM placement using VM Host Rules
Domain Controller VM Placement VM Host Rule

After you click “OK”, the rule should now be saved, and DRS will make sure these VMs are now running on separate hosts.

Below you can see another example of a configured system, separating 2 Active/Passive Firewall appliances.

VM placement and VM/Host Rules for Firewall appliances
VM/Host Rules for Firewall Appliances

As you can see, VM placement with VM/Host Rules is very easy to configure and deploy.

Additional Considerations

Note, if you implement these rules and do not have enough hosts to fullfill the requirements, the hosts may fail to be evacuated by DRS when placing in maintenance mode, or remediating with vLCM (Lifecycle Manager).

In this case, you’ll need to manually vMotion the VM’s to other hosts (to violate the rule) or shut some down.

Jul 242023
 
Picture of an DL360p Gen8 1U Rack Server with IO-PEX40152 Installed

A few months ago, you may have seen my post detailing my experience with ESXi 7.0 on HP Proliant DL360p Gen8 servers. I now have an update as I have successfully loaded ESXi 8.0 on HPE Proliant DL360p Gen8 servers, and want to share my experience.

It wasn’t as eventful as one would have expected, but I wanted to share what’s required, what works, and stability observations.

Please note, this is NOT supported and NOT recommended for production environments. Use the information at your own risk.

A special thank you goes out to William Lam and his post on Homelab considerations for vSphere 8, which provided me with the boot parameter required to allow legacy CPUs.

ESXi on the DL360p Gen8

With the release of vSphere 8.0 Update 1, and all the new features and functionality that come with the vSphere 8 release as a whole, I decided it was time to attempt to update my homelab.

In my setup, I have the following:

  • 2 x HPE Proliant DL360p Gen8 Servers
    • Dual Intel Xeon E5-2660v2 Processors in each server
    • USB and/or SD for booting ESXi
    • No other internal storage
    • NVIDIA A2 vGPU (for use with VMware Horizon)
  • External SAN iSCSI Storage

Since I have 2 servers, I decided to do a fresh install using the generic installer, and then use the HPE addon to install all the HPE addons, drivers, and software. I would perform these steps on one server at a time, continuing to the next if all went well.

I went ahead and documented the configuration of my servers beforehand, and had already had upgraded my VMware vCenter vCSA appliance from 7U3 to 8U1. Note, that you should always upgrade your vCenter Server first, and then your ESXi hosts.

To my surprise the install went very smooth (after enabling legacy CPUs in the installer) on one of the hosts, and after a few days with no stability issues, I then proceeded and upgraded the 2nd host.

I’ve been running with 100% for 25+ days without any issues.

The process – Installing ESXi 8.0

I used the following steps to install VMware vSphere ESXi 8 on my HP Proliant Gen8 Server:

  1. Download the Generic ESXi installer from VMware directly.
    1. Link: Download VMware vSphere
  2. Download the “HPE Custom Addon for ESXi 8.1”.
    1. Link: HPE Custom Addon for ESXi 8.0 U1 June 2023
    2. Other versions of the Addon are here: HPE Customized ESXi Image.
  3. Boot server with Generic ESXi installer media (CD or ISO)
    • IMPORTANT: Press “Shift + o” (Shift key, and letter “o”) to interrupt the ESXi boot loader, and add “AllowLegacyCPU=true” to the kernel boot parameters.
  4. Continue to install ESXi as normal.
    • You may see warnings about using a legacy CPU, you can ignore these.
  5. Complete initial configuration of ESXi host
  6. Mount NFS or iSCSI datastore.
  7. Copy HPE Custom Addon for ESXi zip file to datastore.
  8. Enable SSH on host (or use console).
  9. Place host in to maintenance mode.
  10. Run “esxcli software vib install -d /vmfs/volumes/datastore-name/folder-name/HPE-801.0.0.11.3.1.1-Jun2023-Addon-depot.zip” from the command line.
  11. The install will run and complete successfully.
  12. Restart your server as needed, you’ll now notice that not only were HPE drivers installed, but also agents like the Agentless management agent, and iLO integrations.

After that, everything was good to go… Here you can see version information from one of the ESXi hosts:

ESXi 8 on HPE Proliant DL360p Gen8
VMware ESXi version 8.0.1 Build 21813344 on HPE Proliant DL360p Gen8 Server

What works, and what doesn’t

I was surprised to see that everything works, including the P420i embedded RAID controller. Please note that I am not using the RAID controller, so I have not performed extensive testing on it.

HPE P420i RAID Controller with VMware vSphere ESXi 8
HPE P420i RAID Controller with VMware vSphere ESXi 8

All Hardware health information is present, and ESXi is functioning as one would expect if running a supported version on the platform.

Additional Information

Note that with vSphere 8, VMware is deprecating vLCM baselines. Your focus should be to update your ESXi instances using cluster image based update images. You can incorporate vendor add-ons and components to create a customized image for deployment.

Mar 062023
 
VMware vSphere 7 Logo

You might ask if/what the procedure is for updating Enhanced Linked Mode vCenter Server Instances, or is there even any considerations that apply?

vCenter Enhanced Link Mode is a feature that allows you to link a total of 15 vCenter Instances in to a single, Single Sign On (SSO) vSphere domain. This allows you to have a single set of credentials to manage all 15 instances, as well as the ability to manage all of them from a single pane of glass.

When it comes to environments with multiple vCenter instance and/or vCSA appliances, this really helps manageability, and visibility.

Enhanced Linked Mode Upgrade Considerations

To answer the question above: Yes, when you’re running Enhanced Linked Mode (ELM) to link multiple vCenter Server, special considerations and requirements exist when it comes to updating or upgrading your vCenter Server instances and vCSA appliances.

Multiple VMware vCenter Server Instances (vCSA) Running in Enhanced Link Mode (ELM)
Multiple VMware vCenter Server Instances (vCSA) Running in Enhanced Link Mode (ELM)

Not only have these procedures been documented in older VMware documentation, but I recently reviewed and confirmed the best practices with VMware GSS while on a support case.

Procedure for updating vCenter with ELM

  1. Configure/Confirm that the vCenter File-Based Backup in VAMI is configured, functioning, and that you are creating valid file based backups.
  2. Create a manual file-based backup with VAMI
  3. Power down all vCenter Instances and vCSA Appliances in your environment
  4. Perform a cold snapshot of all vCenter Instances and vCSA appliances
    • *This is critical* – You need a valid offline snapshot taken of all appliances powered off at the same point in time
  5. Power on the vCenter/vCSA Virtual Machines (VMs)
  6. Perform the update or upgrade

Recovering from a failed Update

IMPORTANT: In the event that an update or upgrade fails, you must revert all vCenter Instances and/or vCSA appliances back to the previous snapshot!

You cannot selectively choose single or individual instances, as this may cause mismatches in data and configuration between the instances as they have databases that are not in sync, and are from different points in time.

Additionally, if you are in a situation where you’re considering or planning to restore previous snapshots to recover from a failed update, you should do so sooner than later. As time progresses, service accounts and identifiers update in the VMware vSphere infrastructure. Delaying the restore too long could cause this information to get out of sync with the ESXi hosts after performing a snapshot restore/revert.

Feb 252023
 
vCenter-Root-CA-Missing

When using VMware vSphere, you may notice vCenter OVF Import and Datastore File Access Issues, when performing various tasks with OVF Imports, as well as uploading and/or downloading files from datastores.

These issues can cause a number of symptoms including errors, unexpected status codes, and also just simply failing for an undetermined reason.

vCenter File Upload failed error "The Operation failed."
vCenter File Upload: The Operation failed.

The Problem

For this situation, the symptoms will occur when performing one of the following tasks:

  • Cannot Upload File to datastore
  • Cannot Download File from datastore
  • Cannot Import OVF Template
  • Cannot Export OVF Template

An example of errors that the user may see:

  • The operation failed for an undetermined reason.
  • The operation failed.
  • unexpectedStatusCode":0
  • unexpectedStatusCode (0)
  • HTTP 500 Error
  • NET::ERR_CERT_AUTHORITY_INVALID

See below for some example screenshots of errors you may see.

vCenter Error: "The operation failed for an undetermined reason."
The Operation failed: The Operation failed for an undetermined reason.
Chrome and vCenter report "NET::ERR_CERT_AUTHORITY_INVALID" error
“NET::ERR_CERT_AUTHORITY_INVALID”

Please note, that this condition can cause other issues and errors as well.

The Solution

When using VMware vSphere, the vCenter server acts as it’s own Root Certification Authority, and uses SSL certificates to facilitate communication and encryption between various services in the solution, as well as the communication between the vCenter Server, ESXi hosts, and any client computers accessing vCenter via the web HTML5 interface.

This Root Certification Authority running on the vCenter Server creates and issues certificates to these services and hosts, which are issues under the Root CA Certificate.

While vCenter automatically handles the certificate trusts between the services, as well as the communicate between the vCenter Server and ESXi hosts (this is automatically setup when adding hosts to vCenter), it cannot automatically make your (client) computer trust the entire certificate authority, as well as all the child certificates.

To resole this issue, you’ll need to follow my guide on How to Install the vSphere vCenter Root Certificate on your computer you are using to connect to the vCenter interface.

After installing the vCenter Root CA on your system, the issue will be resolved.

Nov 202022
 

Today I want to talk about Memory Deduplication on ESXi with Transparent Page Sharing (TPS). This is a technology that isn’t widely known about, even amongst IT professionals with significant experience with VMware products.

And you may ask “Memory Deduplication, why aren’t we using this?!?” as it sounds like a pretty cool piece of technology… Well, I’m about to tell you why you’re not (Inter-VM), and share a few examples of where you would want to enable this.

I also want to show you how to enable TPS globally (Inter-VM), and also discuss TPS being used with VMware Horizon and VDI.

What is Transparent Page Sharing (TPS)?

Transparent Page Sharing is the process in which ESXi can provide memory deduplication by storing duplicate memory pages as a single page on the physical memory of the host. This process stops the system from storing redundant memory pages, and thus frees up physical memory for other uses.

If my memory serves me right, this was originally enabled by default in ESX/ESXi version 5, but was later globally disabled due to security vulnerabilities and concerns.

Note, that TPS is still enabled by default from within the same VM, even today with ESXi 8.

Security Concerns

I recall two potential scenarios and security concerns which led to VMware changing the original default behavior for TPS.

  • Scenario 1 included a concern about an attacker gaining access to a VM, and then having the ability to modify the memory contents of another VM.
  • Scenario 2 included a concern where an attacker may be able to get access to encryption keys used on another system.

A quick search led to a KB titled “Security considerations and disallowing inter-Virtual Machine Transparent Page Sharing (2080735)“, which outlines the details of scenario 2, along with stating “This technique works only in a highly controlled system configured in a non-standard way that VMware believes would not be recreated in a production environment”.

With that being said, it sounds like this would be an extremely difficult attack that requires systems to be configured in a non-standard way.

Current status of TPS

Believe it or not, TPS and memory deduplication is still enabled, however it’s only deduplicating pages from within the same VM. TPS is not deduplicating pages from multiple VMs.

Additionally, VMware has given us controls to configure TPS to allow it amongst multiple VMs, or even enable it globally across the ESXi host.

See below for the settings to configure TPS on ESXi via “Advanced Settings”:

A table providing configurable options for Transparent Page Sharing (TPS) on VMware vSphere ESXi
Transparent Page Sharing (TPS) Settings for ESXi Host

The above table was provided by VMware’s “Additional Transparent Page Sharing management capabilities and new default settings (2097593)” KB.

In short, you could enable TPS globally (Inter-VM) by setting “Mem.ShareForceSalting” in “Advanced Settings”, to a value of “0”. You can also use the salting to configure groups of VMs that are allow to share memory pages.

Additionally, you can tweak the behavior of TPS by modifying some of the settings shown below:

TPS Memory Sharing Settings

As you can see you can configure things like the scanning occurrence (Mem.ShareScanTime) of how often the system will check for memory pages that can be shared/deduplicated and other settings.

TPS is enabled, but not working

So, you may have decided to enable TPS in your environment, but you’re noticing that either no, or very little memory pages are being marked as shared.

ESXi Memory Graph showing Memory Deduplication from TPS
TPS Memory Deduplication – Amount of host physical memory that backs shared guest physical memory

In the example above, you’ll notice that on a loaded host, with TPS enabled globally (Inter-VM, amongst all VMs), that the host is only deduplicating 1,052KB of memory.

This is because you will most often only see TPS being heavily utilized on an ESXi host that has over-committed memory, there’s also a chance that you simply don’t have enough memory pages that can be duplicated.

Memory Deduplication, TPS, and VMware Horizon VDI

Because VMware Horizon utilizes the “vmfork” with “Just-in-Time” desktop delivery, non-persistent VDI will benefit from some level of memory deduplication by default when using Instant Clones with non-persistent VDI. This is because non-persistent VDI guests are spawned from a running base image.

Additionally, you can further implement, enable, and configure TPS by configuring some Transparent Page Sharing options inside of the VMware Horizon Administration console.

When creating a Desktop Pool, you can set the “Transparent Page Sharing” open to “Virtual Machine” (Memory dedupe inside of the VM only), “Pool” (Memory dedupe across the Desktop Pool), “Pod” (Dedupe across the pod), or “Global” (Full Inter-VM memory deduplication across the ESXi host).

If you enabled TPS on the ESXi host globally, these settings are null and not used.

TPS Use Cases

So you might be asking when it’s a good time to use TPS?

  • The Homelab – When is a homelab not a good reason to try something? Looking to save some memory and overcommit memory resources? Implement TPS.
  • VDI Environments – On highly dense hosts, you may consider implementing TPS at some level to maximize the utilization of resources, however you must be aware of the security consequences and factor this in when configuring TPS.
  • Environments with no Sensitive Information – It’s hard to imagine, but if you have an environment that doesn’t contain any sensitive information and doesn’t use any security keys, it would be suitable to enable TPS.

I’m sure there’s a number of other use cases, so leave a comment if you can think of one.

Conclusion

In my opinion Transparent Page Sharing is a technology that should not be forgotten and discarded. VMware admins should be aware of it, how to configure it, and what the implications are of using it.

If you are considering enabling TPS in your environment, you must factor in the potential security consequences of doing so.

Oct 302022
 
vGPU nvidia-smi GPU Link Info

If you’re like me, you want to make sure that your environment is as optimized as possible. I recently noticed that my NVIDIA A2 vGPU was reporting the vGPU PCIe Link Speed and Generation that the card was using was below what it was supposed to be running at on my VMware vSphere ESXi host.

I needed to find out if this was being reported incorrectly, if there was an issue, or something else effecting this. In my case, the specific GPU I was using is supposed to support PCIe Gen4, and has a physical connector supporting 4x, my host has PCIe Gen3 slots, so I should at least be getting Gen3 speeds.

NVIDIA A2 vGPU

The Problem

When running the command “nvidia-smi -q”, the GPU was reporting that it was only running at PCIe Gen 1 speeds, as shown below:

        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x

Something else worth noting, is that the card states that it supports a 16x interface, when it actually only physical has a 8x connector. I believe they use this chip on another board that has multiple GPUs on a single board that actually supports 16x.

You could say I was quite puzzled. Why would the card only be running at PCIe Generation 1 speeds? I thought it could be any of the scenarios below:

  • Dynamic mode that alternates when required (possibly for power savings)
  • Hardware issue
  • Hardware Limitation (I’m using this in an older server)
  • Software issues
  • Configuration issue

Unfortunately, when searching the internet, I couldn’t find many references to this metric, however I did find references from other user’s copy/pastes of “nvidia-smi -q” which had the same values (running PCIe Gen1), even with beefier and more high-end cards.

The Solution

After some more searching, I finally came across an NVIDIA technical document titled “Useful nvidia-smi Queries” that states that the current PCIe Generation Link speed “may be reduced when the GPU is not in use”. This confirms that it’s dynamic and adjusts when needed.

Finally, I decided to give some games a shot in a couple of the VMs, and to my surprise when running a game, the “Device Current” and “Current” PCIe Generation changed to PCIe Gen3 (note that my server isn’t capable of PCIe Gen4, which is the cards maximum), as shown below:

        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
                Device Current            : 3
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x

In conclusion, if you notice this in your environment, do not be alarmed as this is completely normal and expected behavior.

Sep 042022
 

When either directly passing through a GPU, or attaching an NVIDIA vGPU to a Virtual Machine on VMware ESXi that has more than 16GB of Video Memory, you may run in to a situation where the VM fails to boot with the error “Module ‘DevicePowerOn’ power on failed.”. Special considerations are required when performing GPU or vGPU Passthrough with 16GB+ of video memory.

This issue is specifically caused by memory mapping a GPU or vGPU device that has 16GB of memory or higher, and could involve both the host system (the ESXi host) and/or the Virtual Machine configuration.

In this post, I’ll address the considerations and requirements to passthrough these devices to virtual machines in your environment.

In the order of occurrence, it’s usually VM configuration related, however if the recommendations in the “VM Configuration Considerations” section do not resolve the issue, please proceed to reviewing the “ESXi Host Considerations” section.

Please note that if the issue is host related, other errors may be present, or the device may not even be visible to ESXi.

VM GPU and vGPU Configuration Considerations

First and foremost, all new VMs should be created using the “EFI” Firmware type. EFI provides numerous advantages in device access and memory mapping versus the older style “BIOS” firmware types.

VM Firmware type EFI

To do this, create a new virtual machine, navigate to “VM Options”, expand “Boot Options”, and confirm/change the Firmware to “EFI”. I recommend this for all new VMs, and not only for VMs accessing GPUs or vGPUs with over 16GB of memory. Please note that you shouldn’t change an existing VM, and should do this on a fresh new VM.

With performing GPU or vGPU Passthrough with 16GB+ of video memory, you’ll need to create a couple of entries under “Advanced” settings to properly configure access to these PCIe devices and provide the proper environment for memory mapping. The lack of these settings is specifically what causes the “Module ‘DevicePowerOn’ power on failed.” error.

Under the VM settings, head over to “VM Options”, expand “Advanced” and click on “Edit Configuration”, click on “Add Configuration Params”, and add the following entries:

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=32

Example below:

VM GPU and vGPU Memory Settings for 16GB or higher memory mapping

You’ll notice that while our GPU or vGPU profile may have 16GB of memory, we need to double that value, and set it for the “pciPassthru.64bitMMIOSizeGB” variable. If your card or vGPU profile had 32GB, you’d set it to “64”.

Additionally if you were passing through multiple GPUs or vGPU devices, you’d need to factor all the memory being mapped, and double the combined amount.

ESXi GPU and vGPU Host Considerations

On most new and modern servers, the host level doesn’t require any special configuration as they are already designed to pass through such devices to the hypervisor properly. However in some special cases, and/or when using older servers, you may need to modify configuration and settings in the UEFI or BIOS.

If setting the VM Configuration above still results in the same error (or possibly other errors), than you most likely need to make modifications to the ESXi hosts BIOS/UEFI/RBSU to allow the proper memory mapping of the PCIe device, in our case being the GPU.

This is where things get a bit tricky because every server manufacturer has different settings that will need to be configured.

Look for the following settings, or settings with similar terminology:

  • “Memory Mapping Above 4G”
  • “Above 4G Decoding”
  • “PCI Express 64-Bit BAR Support”
  • “64-Bit IOMMU Mapping”

Once you find the correct setting or settings, enable them.

Every vendor could be using different terminology and there may be other settings that need to be configured that I don’t have listed above. In my case, I had to go in to a secret “SERVICE OPTIONS” menu on my HPE Proliant DL360p Gen8, as documented here.

After performing the recommendations in this guide, you should now be able to passthrough devices with over 16GB of memory.

Additional Resources:

Sep 042022
 

With VMware ESXi 6.5 and 6.7 going End of Life on October 15th, 2022, many of you are looking for options to update hosts in your homelab, especially in my case putting ESXi 7.0 on HP Proliant DL360p Gen8 servers.

As far as support goes, HPE last provided a custom installer for ESXi for versions 6.5 U3 which was released December of 2019. This was the “last Pre-Gen9 custom image” released, as ESXi 7.0 on the DL360p Gen8 is totally unsupported.

Update: Check out my post covering ESXi 8.0 on HPE Proliant DL360p Gen8 servers!

ESXi 6.7 or higher on the Gen8 Servers

The jump from 6.5 to 6.7 was a little easier, as you could use the 6.5 custom installer, and then upgrade to 6.7. For the most part, as long as the hardware itself was supported, you were in pretty good shape.

Additionally, with the HPE vibsdepot loaded in to VMware Update Manager (now known as Lifecycle Manager), you could also keep all the HPE drivers and agents up to date.

ESXi 7.0 on the Gen8 Servers

Some were lucky enough to upgrade their current installs to 7 with no or limited problems, however the general consensus online was to expect problems. There were some major driver changes, which I think at one point led to an advisory to perform a fresh install, even if you had a fully supported configuration with newer generation servers such as the Proliant Gen9 and Gen10 servers, when upgrading from older versions.

In my setup, I have the following:

  • 2 x HPE Proliant DL360p Gen8 Servers
    • Dual Intel Xeon E5-2660v2 Processors in each server
    • USB and/or SD for booting ESXi
    • No other internal storage
  • External SAN iSCSI Storage

Concerns and Considerations

My main concern is to not only have a stable and functioning ESXi 7 instance, but I also, if possible would like to have the HPE drivers, agents, and integrations with iLO.

You must consider that while this is completely unsupported, you do need to make sure that the components of your current configuration are supported, such as the processor and PCIe cards, even if the server as a whole is not supported.

Make sure you reference your hardware on the VMware Compatibility Guide (HCL).

In my case, my processors were supported, however my RAID controller was not. So theoretically, since I’m not using my RAID controllers, I should be fine.

The process – Installing ESXi 7.0

I was able to install ESXi 7.0 on my HPE Proliant Gen8 servers, by performing the following steps.

  1. Download the Generic ESXi installer from VMware directly.
    1. Link: Download VMware vSphere
  2. Download the “HPE Custom Addon for ESXi 7.0”.
    1. Link: HPE Custom Addon for ESXi 7.0 U3 for July 2022
  3. Boot server, install using the Generic Installer downloaded above.
  4. Mount NFS or iSCSI datastore.
  5. Copy HPE Custom Addon for ESXi zip file to datastore.
  6. Enable SSH on host (or use console).
  7. Place host in to maintenance mode.
  8. Run “esxcli software vib install -d /vmfs/volumes/datastore-name/folder-name/HPE-703.0.0.10.9.1.5-Jul2022-Addon-depot.zip” from the command line.
  9. The install will run and complete successfully.
  10. Restart your server as needed, you’ll now notice that not only were HPE drivers installed, but also agents like the Agentless management agent, and iLO integrations.

You’ll now have a functioning instance.

HP Proliant DL360p Gen8 running ESXi 7.0

In my case everything was working, except for the “Smart Array P420i” RAID Controller, which I don’t use anyways.

Additionally, if you have a vCenter instance running, make sure that you add the HPE vibsdepot repo to your Lifecycle Manager. After you add the repo, create a baseline, and attach the baseline to the host, go ahead and proceed to scan, stage, and remediate the server which will then further update all the HPE specific drivers and software.