We’ve all been in the situation where we need to install a driver, vib file, or check “esxtop”. Many advanced administration tasks on ESXi need to be performed via shell access, and to do this you either need a console on the physical ESXi host, an SSH session, or use the Remote vCLI.
In this blog post, I’m going to be providing a quick “How to” enable SSH on an ESXi host in your VMware Infrastructure using the vCenter flash-based web administration interface. This will allow you to perform the tasks above, as well as use the “esxcli” command which is frequently needed.
This method should work on all vCenter versions up to 6.7, and ESXi versions up to 6.7.
How to Enable SSH on an ESXi Host Server
Log on to your vCenter server.
On the left hand “Navigator” pane, select the ESXi host.
On the right hand pane, select the “Configure” tab, then “Security Profile” under “System.
Scroll down and look for “Services” further to the right and select “Edit”.
In the “Edit Security Profile” window, select and highlight “SSH” and then click “Start”.
This method can also be used to stop, restart, and change the startup policy to enable or disable SSH starting on boot.
Congratulations, you can now SSH in to your ESXi host!
I can’t tell you how excited I am that after many years, I’ve finally gotten my hands on and purchased an Nvidia Quadro K1 GPU. This card will be used in my homelab to learn, and demo Nvidia GRID accelerated graphics on VMware Horizon View. In this post I’ll outline the details, installation, configuration, and thoughts. And of course I’ll have plenty of pictures below!
The focus will be to use this card both with vGPU, as well as 3D accelerated vSGA inside in an HPe server running ESXi 6.5 and VMware Horizon View 7.8.
Please Note: Some, most, or all of what I’m doing is not officially supported by Nvidia, HPe, and/or VMware. I am simply doing this to learn and demo, and there was a real possibility that it may not have worked since I’m not following the vendor HCL (Hardware Compatibility lists). If you attempt to do this, or something similar, you do so at your own risk.
For some time I’ve been trying to source either an Nvidia GRID K1/K2 or an AMD FirePro S7150 to get started with a simple homelab/demo environment. One of the reasons for the time it took was I didn’t want to spend too much on it, especially with the chances it may not even work.
Essentially, I have 3 Servers:
HPe DL360p Gen8 (Dual Proc, 128GB RAM)
HPe DL360p Gen8 (Dual Proc, 128GB RAM)
HPe ML310e Gen8 v2 (Single Proc, 32GB RAM)
For the DL360p servers, while the servers are beefy enough, have enough power (dual redundant power supplies), and resources, unfortunately the PCIe slots are half-height. In order for me to use a dual-height card, I’d need to rig something up to have an eGPU (external GPU) outside of the server.
As for the ML310e, it’s an entry level tower server. While it does support dual-height (dual slot) PCIe cards, it only has a single 350W power supply, misses some fancy server technologies (I’ve had issues with VT-d, etc), and only a single processor. I should be able to install the card, however I’m worried about powering it (it has no 6pin PCIe power connector), and having ESXi be able to use it.
Finally, I was worried about cooling. The GRID K1 and GRID K2 are typically passively cooled and meant to be installed in to rack servers with fans running at jet engine speeds. If I used the DL360p with an external setup, this would cause issues. If I used the ML310e internally, I had significant doubts that cooling would be enough. The ML310e did have the plastic air baffles, but only had one fan for the expansion cards area, and of course not all the air would pass through the GRID K1 card.
Because of a limited budget, and the possibility I may not even be able to get it working, I didn’t want to spend too much. I found an eBay user local in my city who had a couple Grid K1 and Grid K2 cards, as well as a bunch of other cool stuff.
We spoke and he decided to give me a wicked deal on the Grid K1 card. I thought this was a fantastic idea as the power requirements were significantly less (more likely to work on the ML310e) on the K1 card at 130 W max power, versus the K2 card at 225 W max power.
We set a time and a place to meet. Preemptively I ran out to a local supply store to purchase an LP4 power adapter splitter, as well as a LP4 to 6pin PCIe power adapter. There were no available power connectors inside of the ML310e server so this was needed. I still thought the chances of this working were slim…
I also decided to go ahead and download the Nvidia GRID Software Package. This includes the release notes, user guide, ESXi vib driver (includes vSGA, vGPU), as well as guest drivers for vGPU and pass through. The package also includes the GRID vGPU Manager. The driver I used was from: https://www.nvidia.com/Download/driverResults.aspx/144909/en-us
To install, I copied over the vib file “NVIDIA-vGPU-kepler-VMware_ESXi_6.5_Host_Driver_367.130-1OEM.6220.127.116.1198673.vib” to a datastore, enabled SSH, and then ran the following command to install:
The command completed successfully and I shut down the host. Now I waited to meet.
We finally met and the transaction went smooth in a parking lot (people were staring at us as I handed him cash, and he handed me a big brick of something folded inside of grey static wrap). The card looked like it was in beautiful shape, and we had a good but brief chat. I’ll definitely be purchasing some more hardware from him.
Installing the card in the ML310e was difficult and took some time with care. First I had to remove the plastic air baffle. Then I had issues getting it inside of the case as the back bracket was 1cm too long to be able to put the card in. I had to finesse and slide in on and angle but finally got it installed. The back bracket (front side of case) on the other side slid in to the blue plastic case bracket. This was nice as the ML310e was designed for extremely long PCIe expansion cards and has a bracket on the front side of the case to help support and hold the card up as well.
For power I disconnected the DVD-ROM (who uses those anyways, right?), and connected the LP5 splitter and the LP5 to 6pin power adapter. I finally hooked it up to the card.
I laid the cables out nicely and then re-installed the air baffle. Everything was snug and tight.
Please see below for pictures of the Nvidia GRID K1 installed in the ML310e Gen8 V2.
Powering on the server was a tense moment for me. A few things could have happened:
Server won’t power on
Server would power on but hang & report health alert
Nvidia GRID card could overheat
Nvidia GRID card could overheat and become damaged
Nvidia GRID card could overheat and catch fire
Server would boot but not recognize the card
Server would boot, recognize the card, but not work
Server would boot, recognize the card, and work
With great suspense, the server powered on as per normal. No errors or health alerts were presented.
I logged in to iLo on the server, and watched the server perform a BIOS POST, and start it’s boot to ESXi. Everything was looking well and normal.
After ESXi booted, and the server came online in vCenter. I went to the server and confirmed the GRID K1 was detected. I went ahead and configured 2 GPUs for vGPU, and 2 GPUs for 3D vSGA.
I restarted the X.org service (required when changing the options above), and proceeded to add a vGPU to a virtual machine I already had configured and was using for VDI. You do this by adding a “Shared PCI Device”, selecting “NVIDIA GRID vGPU”, and I chose to use the highest profile available on the K1 card called “grid_k180q”.
After adding and selecting ok, you should see a warning telling you that must allocate and reserve all resources for the virtual machine, click “ok” and continue.
Power On and Testing
I went ahead and powered on the VM. I used the vSphere VM console to install the Nvidia GRID driver package (included in the driver ZIP file downloaded earlier) on the guest. I then restarted the guest.
After restarting, I logged in via Horizon, and could instantly tell it was working. Next step was to disable the VMware vSGA Display Adapter in the “Device Manager” and restart the host again.
Upon restarting again, to see if I had full 3D acceleration, I opened DirectX diagnostics by clicking on “Start” -> “Run” -> “dxdiag”.
It worked! Now it was time to check the temperature of the card to make sure nothing was overheating. I enabled SSH on the ESXi host, logged in, and ran the “nvidia-smi” command.
According to this, the different GPUs ranged from 33C to 50C which was PERFECT! Further testing under stress, and I haven’t gotten a core to go above 56. The ML310e still has an option in the BIOS to increase fan speed, which I may test in the future if the temps get higher.
With “nvidia-smi” you can see the 4 GPUs, power usage, temperatures, memory usage, GPU utilization, and processes. This is the main GPU manager for the card. There are some other flags you can use for relevant information.
Overall I’m very impressed, and it’s working great. While I haven’t tested any games, it’s working perfect for videos, music, YouTube, and multi-monitor support on my 10ZiG 5948qv. I’m using 2 displays with both running at 1920×1080 for resolution.
I’m looking forward to doing some tests with this VM while continuing to use vGPU. I will also be doing some testing utilizing 3D Accelerated vSGA.
The two coolest parts of this project are:
3D Acceleration and Hardware h.264 Encoding on VMware Horizon
Getting a GRID K1 working on an HPe ML310e Gen8 v2
Leave a comment and let me know what you think! Or leave a question!
Unable to boot ESXi from USB or SD Card on HPe Proliant Server
After installing HPe iLO Amplifier on your network and updating iLO 4 firmware to 2.60 or 2.61, you may notice that your HPe Proliant Servers may fail to boot ESXi from a USB drive or SD-Card.
This was occuring on 2 ESXi Hosts. Both were HPe Proliant DL360p Gen8 Servers. One server was using an internal USB drive for ESXi, while the other was using an HPe branded SD Card.
The issue started occuring on both hosts after a planned InfoSight implementation. Both hosts iLO controllers firmware were upgraded to 2.61, iLO Amplifier was deployed (and the servers added), and the amplifier was connected to an HPe InfoSight account.
Update – May 24th 2019: As an HPe partner, I have been working with HPe, the product manager, and development team on this issue. HPe has provided me with a fix to test that I have been able to verify fully resolves this issue! Stay tuned for more information!
Update – June 5th 2019: Great news! As Bob Perugini (WW Product Manager at HPe) put it: “HPE is happy to announce that this issue has been fixed in latest version of iLO Amplifier Pack, v1.40. To download iLO Amplifier Pack v1.40, go to http://www.hpe.com/servers/iloamplifierpack and click “download”.” Scroll to the bottom of the post for more information!
mboot.c32: attempted DOS system call INT 21 0d00 E8004391 boot:
This issue may occur intermittently, on the majority of boots, or on all boots. Re-installing ESXi on the media, as well as replacing the USB/SD Card has no effect. Installation will be successful, however you the issue is still experiences on boot.
HPe technical support was unable to determine the root of the issue. We found the source of the issue and reported it to HPe technical support and are waiting for an update.
The Issue and Fix
This issue occurs because the HPe iLO Amplifier is running continuous server inventory scans while the hosts are booting. When one inventory completes, it restarts another scan.
The following can be noted:
iLO Amplifier inventory percentage resets back to 0% and starts again numerous times during the server boot
Inventory scan completes, only to restart again numerous times during the server boot
Inventory scan resets back to 0% during numerous different phases of BIOS initialization and POST.
We noticed that once the HPe iLO Amplifier Virtual Machine was powered off, not only did the servers boot faster, but they also booted 100% succesfully each time. Powering on the iLO Amplifier would cause the ESXi hosts to fail to boot once again.
I’d also like to note that on the host using the SD-Card, the failed boot would actually completely lock up iLO, and would require physical intervention to disconnect and reconnect the power to the server. We were unable to restart the server once it froze (this did not happen to the host using the USB drive).
There are some settings on the HPe iLO amplifier to control performance and intervals of inventory scans, however we noticed that modifying these settings did not alter or stop the issue, and had no effect.
As a temporary workaround, make sure your iLO amplifier is powered off during any maintenance to avoid hosts freezing/failing to boot.
I’ve attempted to try downgrading to the earliest supported iLo version 2.54, and the issue still occurs.
I also upgraded to the newest version 2.62 which presented some new issues.
On the first boot, the BIOS reported memory access issues on Processor 1 socket 1, then another error reporting memory access issues on Processor 1 socket 4.
I disconnected the power cables, reconnected, and restarted the server. This boot, the server didn’t even detect the bootable USB stick.
Again, after shutting down the iLo Amplifier, the server booted properly and the issue disappeared.
Update – May 24th 2019
As an HPe partner, I have been working with HPe, the product manager, and development team on this issue. HPe has provided me with a fix to test that I have been able to verify fully resolves this issue! Stay tuned for more information!
Update – June 5th 2019 – ITS FIXED!!!
Great news as the issue is now fixed! As Bob Perugini (WorldWide Product Manager at HPe) said it:
HPE is happy to announce that this issue has been fixed in latest version of iLO Amplifier Pack, v1.40.
Here’s what’s new in iLO Amplifier Pack v1.40: ─ Available as a VMware ESXi appliance and as a Hyper-V appliance (Hyper-V is new) ─ VMware tools have been added to the ESXi appliance ─ Ability to schedule the time of the daily transmission of Active Health System (AHS) data to InfoSight ─ Ability to opt-in and allow the IP address and hostname of the server to be transmitted to InfoSight and displayed ─ Test connectivity button to help verify iLO Amplifier Pack has successfully connected to InfoSight ─ Allow user authentication credentials for the proxy server when connecting to InfoSight ─ Added ability to specify IP address or hostname for the HPE RDA connection when connection to InfoSight ─ Ability to send updated AHS data “now” for an individual server ─ Ability to stage firmware and driver updates to the iLO Repository and then deploy the staged updates at a later date or time (HPE Gen10 servers only) ─ Allow the firmware and driver updates of servers whose iLO has been configured in CNSA (Commercial National Security Algorithm) mode (HPE Gen10 servers only)
On an ESXi host when performing a manual unmap on your storage datastore, you may notice a very large (hidden) file on the datastore root called “.asyncUnmapFile”. This file could be taking up terabytes of space, and you aren’t able to delete this file.
Typically the asyncUnmapFile is used by the UNMAP feature on ESXi hosts to deal with unmapping and unallocating storage blocks on the SAN. When you run a manual UNMAP, this file should be created and should appear to using “0” (no) space (even if it is). When an UNMAP completes, this file should disappear and be automatically removed by the function. If an UNMAP is interupted, this file will not be deleted, allowing you to restart the process and upon a full successful completion, it should then be deleted.
Some time ago, I had an issue when performing a manual UNMAP, where the ESXi host became unresponsive (due to memory issues). The command appeared to be completed, however I believe it caused potential issues or corruption on the iSCSI datastore. In subsequent runs, the UNMAP appeared to be functioning and working, however I didn’t realize that the asyncUnmapFile had grown to around 1.5TB.
This was noticed during a SAN storage audit, where we saw that the virtual pool on the SAN was using up way more storage than it should be on the datastore.
When we identified the file was this large and causing issues, we attempted to perform 2 UNMAPs (different reclaim sizes) to see if it would be automatically cleared afterwards. It had no effect and the file was unchanged.
We also tried to modify the permissions on the file, however when trying to delete it, it would report that the file or folder was not found, or that it does not exist. This was concerning as we were worried about potential datastore corruption.
It was also noticed that in the hostd.log and vmkernel.log we saw some errors where the host believed that the blocks on the datastore had already been freed: “on volume labeled ‘iSCSIDatastore01’ already freed by another host: This may be a non-issue”
Unfortunately with all the research we did, we couldn’t find a clear-cut solution. With worries that the datastore may be corrupted, we needed to do something.
A decision was ultimately made to Storage vMotion all the VMs (Virtual Machines) to another datastore on a separate storage pool, delete the now empty LUN, and recreate it from scratch. After this, we used Storage vMotion again to move the VMs back.
Instantly I noticed that the VMs on that datastore were running faster (it’s only been 12 hours, so I’ll be adding an update in a few days to confirm). We no longer have the file on any of our datastores.
If anyone has further insight in to this issue, please leave a comment!
So, what happens in a worst-case scenario where your backup system fails, you don’t have any VM snapshots, and the last thing standing in the way of complete data loss is your SAN storage systems LUN snapshots?
Well, first you fire whoever purchased and implemented the backup system, then secondly you need to start restoring the VM (or VMs) from your SAN LUN snapshots.
While I’ve never had to do this in the past (all the disaster recovery solutions I’ve designed and sold have been tested and function), I’ve always been curious what the process is and would be like. Today I decided to try it out and develop a procedure for restoring a VM from SAN Storage LUN snapshot.
For this test I pretended a VM was corrupt on my VMware vSphere cluster and then restored it to a previous state from a LUN snapshot on my HPe MSA 2040 (identical for the HPe MSA 2050, and MSA 2052) Dual Controller SAN.
To accomplish the restore, we’ll need to create a host mapping on the SAN for the LUN snapshot to a new LUN number available to the hosts. We then need to add and mount the VMFS volume (residing on the snapshot) to the host(s) while assigning it a new signature and then vMotion the VM from the snapshot’s VMFS to original datastore.
Important Notes (Read first):
When mounting a VMFS volume from a SAN snapshot, you MUST RE-SIGNATURE THE SNAPSHOT VMFS volume. Not doing so can cause problems.
The snapshot cannot be mapped as read only, VMFS volumes must be marked as writable in order to be mounted on ESXi hosts.
You must follow the proper procedure to gracefully dismount and detach the VMFS volume and storage device before removing the snapshot’s host mapping on the SAN.
We use Storage vMotion to perform a high-speed move and recovery of the VM. If you’re not licensed for Storage vMotion, you can use the datastore file browser and copy/move from the snapshot VMFS volume to live production VMFS volume, however this may be slower.
During this entire process you do not touch, modify, or change any settings on your existing active production LUNs (or LUN numbers).
Restoring a VM from a SAN LUN snapshot will restore a crash consistent copy of the VM. The VM when recovered will believe a system crash occurred and power was lost. This is NOT a graceful application consistent backup and restore.
Please read your SAN documentation for the procedure to access SAN snapshots, and create host mappings. With the MSA 2040 I can do this live during production, however your SAN may be different and your hosts may need to be powered off and disconnected while SAN configuration changes are made.
Pro tip: You can also power on and initialize the VM from the snapshot before initiating the storage vMotion. This will allow you to get production services back online while you’re moving the VM from the snapshot to production VMFS volumes.
I’m not responsible if you damage, corrupt, or cause any damage or issues to your environment if you follow these procedures.
We are assuming that you have already either deleted the damaged VM, or removed it from your inventory and renamed the VMs folder on the live VMFS datastore to change the name (example, renaming the folder from “SRV01” to “SRV01.bad”. If you renamed the damaged VM, make sure you have enough space for the new restored VM as well.
Mount the VMFS volume on the LUN snapshot to the ESXi host(s)
Identify the VM you want to recover, write it down.
Identify the datastore that the VM resides on, write it down.
Identify the SAN and identify the LUN number that the VMFS datastore resides on, write it down.
Identify the LUN Snapshot unique name/id/number and write it down, confirm the timestamp to make sure it will contain a valid recovery point.
Log on to the SAN and create a host mapping to present the snapshot (you recorded above) to the hosts using a new and unused LUN number.
Log on to your ESXi host and navigate to configuration, then storage adapters.
Select the iSCSI initator and click the “Rescan Storage Adapters” button to rescan all iSCSI LUNs.
VMware ESXi Host Rescan Storage Adapter
Ensure both check boxes are checked and hit “Ok”, wait for the scan to complete (as shown in the “Recent Tasks” window.
VMware ESXi Host Rescan Storage Adapter Window for VMFS Volume and Devices
Now navigate to the “Datastores” tab under configuration, and click on the “Create a new Datastore” button as shown below.
VMware ESXi Host Add Datastore Window
Continue with “VMFS” selected and select continue.
In the next window, you’ll see your existing datastores, as well as your new datastore (from the snapshot). You can leave the “Datastore name” as is since this value will be ignored. In this window you’re going to select the new VMFS datastore from the snapshot. Make sure you confirm this by looking at the LUN number, as well as the value under “SnapshotVolume”. It is critical that you select the snapshot in this window (it should be the new LUN number you added above).
Select next and continue.
On the next window “Mount Option”, you need to change the radio button to and select “Assign a new signature”. This is critical! This will assign a new signature to differentiate it from your existing real production datastore so that the ESXi hosts don’t confuse it.
Continue with the wizard and complete the mount process. At this point ESXi will resignture the VMFS volume and rename it to “snap-OriginalVolumeNameHere”.
You can now browse the VMFS datastore residing on the LUN snapshot and do anything you’d normally be able to do with a normal datastore.
Copy/Move/vMotion the VM from the snapshot VMFS volume to your production VMFS volume
Note: The next steps are only if you are licensed for storage vMotion. If you aren’t you’ll need to use the copy or move function in the file browsing area to copy or move the VMs to your live production VMFS datastores:
Now we’ll go to the vCenter/ESXi host storage area in the web client, and using the “Files” tab, we’ll browse the snapshots VMFS datastore that we just mounted.
Locate the folder for the VM(s) you want to recover, open the folder, right click on the vmx file for the VM and select “Register VM”. Repeat this for any of the VMs you want to recover from the snapshot. Complete the wizard for each VM you register and add it to a host.
Go back to you “Hosts and VMs” view, you’ll now see the VMs are added.
Select and right click on the VM you want to move from the snapshot datastore to your production live datastore, and select “Migrate”.
In the vMotion migrate wizard, select “Change Storage only”.
Continue to the wizard, and storage vMotion the VM from the snapshot VMFS to your production VMFS volume. Wait for the vMotion to complete.
After the storage vMotion is complete, boot the VM and confirm everything is functioning.
Gracefully unmount, detach, and remove the snapshot VMFS from the ESXi host, and then remove the host mapping from the SAN
On each of your ESXi hosts that have access to the SAN, go to the “Datastores” section under the ESXi hosts configuration, right click on the snapshot VMFS datastore, and select “Unmount”. You’ll need to repeat this on each ESXi host that may have automounted the snapshot’s VMFS volume.
On each of your ESXi hosts that have access to the SAN, go to the “Storage Devices” section under the ESXi hosts configuration and identify (by LUN number) the “disk” that is the snapshot LUN. Select and highlight the snapshot LUN disk, select “All Actions” and select “Detach”. Repeat this on each host.
Double check and confirm that the snapshot VMFS datastore (and disk object) have been unmounted and detached from each ESXi host.
You can now log in to your SAN and remove the host mapping for the snapshot-to-LUN. We will not longer present the snapshot LUN to any of the hosts.
Back to the ESXi hosts, navigate to “Storage Adapters”, select the “iSCSI Initiator Adapter”, and click the “Rescan Storage Adapters”. Repeat this for each ESXi host.
Did you know that you can monitor and manage your VMware vSphere environment (ESXi hosts, cluster, and VMs) remotely with the “VMware vSphere Mobile Watchlist” app on your Android phone? Well, you can!
The VMware vSphere Mobile Watchlist (VMware Watchlist) Android App
For some time now, I’ve been using this neat little app from VMware (available for download here) to monitor and manage my vSphere cluster remotely. You can use the app while directly on your LAN, or via VPN (I use it with OpenVPN to connect to my Sophos UTM). I’ve even used it while on airplanes using the on board in-flight WiFi.
The reason why I’m posting about this, is because I’ve never actually heard anyone talk about the app (which I find strange), so I’m assuming others aren’t aware of it’s existence as well.
The app runs extremely well on my Samsung S8+, Samsung S9+, and Samsung Tab E LTE tablet. I haven’t run in to any issues or app crashes.
Let’s take a look at the app
vSphere Mobile Watchlist Login Prompt
The above screen is where you initially log in. I use my Active Directory credentials (since I have my vCenter server integrated with AD).
vSphere Mobile Watchlist Hosts and VMs
In the default view (shown above), you can view a brief summary of your ESXi hosts, as well as a list of virtual machines running.
vSphere Mobile Watchlist Host Information
After selecting an ESXi host, you can view the hosts resources, details, related objects, as well as flip over to view host options.
vSphere Mobile Watchlist Host Options
Under host options, you can Enter Maintenance mode, reboot the host, shutdown the host, disconnect the host, or view the hosts’ sensor data.
vSphere Mobile Watchlist Host Sensor Data (Fans)
Checking the HPe Proliant DL360p Gen8 fan sensor data with VMware Watchlist.
vSphere Mobile Watchlist Host Sensor Data (Temperature)
Checking the HPe Proliant DL360p Gen8 temperature sensor data with VMware Watchlist. While not shown above, you can select individual items to pull the actual temperature values. Please Note that the temperature values are missing a decimal (Example: 2100 = 21.00 Celsius).
vSphere Mobile Watchlist VM Information View
When selecting a VM (Virtual Machine) from the default view, you can view the VM’s Resources (CPU, Memory, and Storage), Details (IP Addresses, DNS hostnames, Guest OS, VMWare Tools Status), related objects, and a list of other VMs running on the same host.
vSphere Mobile Watchlist VM Options
Flipping over to the VM options, we have the ability to power off, suspend, reset, shutdown, or gracefully restart the VM. We also have some snapshot functionality to take a snapshot, or manage VM snapshots.
In my environment I have two HPe DL360p Gen8 Servers and the sensor data is fully supported (I used the HPe custom ESXi install image which includes host drivers).
Running Veeam Backup and Replication, a Microsoft Windows Server Domain Controller may boot in to safe mode and directory services restore mode.
About a week ago, I loaded up Veeam Backup and Replication in to my test environment. It’s a fantastic product, and it’s working great, however today I had a little bit of an issue with a DC running Windows Server 2016 Server Core.
I woke up to a notification that the backup failed due to a VSS snapshot issue. Now I know that VSS can be a little picky at times, so I decided to restart the guest VM. Upon restarting, she came back up, was pingable, and appeared to be running fine, however the backup kept failing with new errors, the event log was looking very strange on the server, and numerous services that were set to automatic were not starting up.
This specific server was installed using Server Core mode, so it has no GUI and is administered via command prompt over RDP, or via remote management utilities. Once RDP’ing in to the server, I noticed the “Safe Mode” branding on each corner of the display, this was very odd. I restarted the server again, this time manually trying to start Active Directory Services manually via services.msc.
Event ID: 16652
General Description: The domain controller is booting to directory services restore mode.
The domain controller is booting to directory services restore mode.
This surprised me (and scared me for that matter). I immediately started searching the internet to find out what would have caused this…
To my relief, I read numerous sites that advise that when an active backup is running on a guest VM which is a domain controller, Veeam activates directory services restore mode temporarily, so in the event of a restore, it will boot to this mode automatically. In my case, the switch was not changed back during the backup failure.
Running the following command in a command prompt, verifies that the safeboot switch is set to dsrepair enabled:
To disable directory services restore mode, type the following in a command prompt:
bcdedit /deletevalue safeboot
Restart the server and the issue should be resolved!
With the news of VMware vSphere 6.7 being released today, a lot of you are looking for the download links for the 6.7 download (including vSphere 6.7, ESXi 6.7, etc…). I couldn’t find it myself, but after doing some scouring through alternative URLs, I came across the link.
I run a Sophos UTM firewall appliance in my VMware vSphere environment and noticed the other day that I was getting warnings on the space used on the ESXi host for the thin-provisioned vmdk file for the guest VM. I thought “Hey, this is weird”, so I enabled SSH and logged in to check my volumes. Everything looked fine and my disk usage was great! So what gives?
After spending some more time troubleshooting and not finding much, I thought to myself “What if it’s not unmapping unused blocks from the vmdk to the host ESXi machine?”. What is unampping you ask? When files get deleted in a guest VM, the free blocks aren’t automatically “unmapped” and released back to the host hypervisor in some cases.
Two things need to happen:
The guest VM has to release these blocks (notify the hypervisor that it’s not using them, making the vmdk smaller)
The host has to reclaim these and issue the unmap command to the storage (freeing up the space on the SAN/storage itself)
On a side note: In ESXi 6.5 and when using VMFS version 6 (VMFS6), the datastores can be configured for automatic unmapping. You can still kick it off manually, but many administrators would prefer it to happen automatically in the background with low priority (low I/O).
Most of my guest VMs automatically do the first step (releasing the blocks back to the host). On Windows this occurs with the defrag utility which issues trim commands and “trims” the volumes. On linux this occurs with the fstrim command. All my guest VMs do this automatically with the exception being the Sophos UTM appliance.
First, a warning: Enable SSH on the Sophos UTM at your own risk. You need to know what you are doing, this also may pose a security risk and should be disabled once your are finished. You’ll need to “su” to root once you log in with the “loginuser” account.
This fix not only applies to the Sophos UTM, but most other linux based guest virtual machines.
Now to fix the issue, I used the “df” command which provides a list of the filesystems, their mount points, and storage free for those fileystems. I’ve included an example below (this wasn’t the full list):
You’ll need to run the fstrim command on every mountpoint for file systems “/dev/sdaX” (X means you’ll be doing this for multiple mountpoints). In the example above, you’ll need to run it on “/”, “/boot”, “/var/storage”, “/var/log”, “/tmp”, and any other mountpoints that use “/dev/sdaX” filesytems.
fstrim / -v
fstrim /var/storage -v
Again, you’ll repeat this for all mount points for your /dev/sdaX storage (X is replaced with the volume number). The command above only works with mountpoints, and not the actual device mappings.
This could run for hours, possibly days depending on your “reclaim-unit” size (this is the block size of the unit you’re trying to reclaim from the VMFS file-system). In this example I choose 8, but most people do something larger like 100, or 200 to reduce the load and time for the command to complete (lower values look for smaller chunks of free space, so the command takes longer to execute).
I let this run for 2 hours on a 10TB datastore, however it may take way longer (possibly 6+ hours, to days).
Finally, not only are we are left with a smaller vmdk file, but we’ve released the space back to the SAN as well!
Last night I updated my VMware VDI envionrment to VMware Horizon 7.4.0. For the most part the upgrade went smooth, however I discovered an issue (probably unrelated to the upgrade itself, and more so just previously overlooked). When connecting with Google Chrome to VMware Horizon HTML Access via the UAG (Unified Access Gateway), an error pops up after pressing the button saying “Failed to connected to the connection server”.
This error pops up ONLY when using Chrome, and ONLY when connecting through the UAG. If you use a different browser (Firefox, IE), this issue will not occur. If you connect using Chrome to the connection server itself, this issue will not occur. It took me hours to find out what was causing this as virtually nothing popped up when searching for a solution.
Finally I stumbled across a VMware document that mentions on View Connection Server instances and security servers that reside behind a gateway (such as a UAG, or Access Point), the instance must be aware of the address in which browsers will connect to the gateway for HTML access.
On a side note, I also deleted my VMware Unified Access Gateways VMs and deployed the updated version that ship with Horizon 7.4.0. This means I deployed VMware Unified Access Gateway 3.2.0. There was an issue importing the configuration from the export backup I took from the previous version, so I had to configure from scratch (installing certificates, configuring URLs, etc…), be aware of this issue importing configuration.