Jan 182018
 

The Problem

I run a Sophos UTM firewall appliance in my VMware vSphere environment and noticed the other day that I was getting warnings on the space used on the ESXi host for the thin-provisioned vmdk file for the guest VM. I thought “Hey, this is weird”, so I enabled SSH and logged in to check my volumes. Everything looked fine and my disk usage was great! So what gives?

After spending some more time troubleshooting and not finding much, I thought to myself “What if it’s not unmapping unused blocks from the vmdk to the host ESXi machine?”. What is unampping you ask? When files get deleted in a guest VM, the free blocks aren’t automatically “unmapped” and released back to the host hypervisor in some cases.

Two things need to happen:

  1. The guest VM has to release these blocks (notify the hypervisor that it’s not using them, making the vmdk smaller)
  2. The host has to reclaim these and issue the unmap command to the storage (freeing up the space on the SAN/storage itself)

On a side note: In ESXi 6.5 and when using VMFS version 6 (VMFS6), the datastores can be configured for automatic unmapping. You can still kick it off manually, but many administrators would prefer it to happen automatically in the background with low priority (low I/O).

Most of my guest VMs automatically do the first step (releasing the blocks back to the host). On Windows this occurs with the defrag utility which issues trim commands and “trims” the volumes. On linux this occurs with the fstrim command. All my guest VMs do this automatically with the exception being the Sophos UTM appliance.

The fix

First, a warning: Enable SSH on the Sophos UTM at your own risk. You need to know what you are doing, this also may pose a security risk and should be disabled once your are finished. You’ll need to “su” to root once you log in with the “loginuser” account.

This fix not only applies to the Sophos UTM, but most other linux based guest virtual machines.

Now to fix the issue, I used the “df” command which provides a list of the filesystems, their mount points, and storage free for those fileystems. I’ve included an example below (this wasn’t the full list):

hostname:/root # df
Filesystem                       1K-blocks     Used Available Use% Mounted on
/dev/sda6                          5412452  2832960   2281512  56% /
udev                               3059712       72   3059640   1% /dev
tmpfs                              3059712      100   3059612   1% /dev/shm
/dev/sda1                           338875    15755    301104   5% /boot
/dev/sda5                         98447760 13659464  79513880  15% /var/storage
/dev/sda7                        129002700  4624468 117474220   4% /var/log
/dev/sda8                          5284460   274992   4717988   6% /tmp
/dev                               3059712       72   3059640   1% /var/storage/chroot-clientlessvpn/dev


You’ll need to run the fstrim command on every mountpoint for file systems “/dev/sdaX” (X means you’ll be doing this for multiple mountpoints). In the example above, you’ll need to run it on “/”, “/boot”, “/var/storage”, “/var/log”, “/tmp”, and any other mountpoints that use “/dev/sdaX” filesytems.

Two examples:

fstrim / -v

fstrim / -v

 

 

fstrim /var/storage -v

fstrim /var/storage -v

 

 

Again, you’ll repeat this for all mount points for your /dev/sdaX storage (X is replaced with the volume number). The command above only works with mountpoints, and not the actual device mappings.

Time to release the unused blocks to the SAN:

The above completes the first step of releasing the storage back to the host. Now you can either let the automatic unmap occur slowly overtime if you’re using VMFS6, or you can manually kick it off. I decided to manually kick it off using the steps I have listed at: https://www.stephenwagner.com/2017/02/07/vmfs-unmap-command-on-vsphere-6-5-with-vmfs-6-runs-repeatedly/

You’ll need to use esxcli to do this. I simply enabled SSH on my ESXi hosts temporarily.

Please note: Using the unmap command on ESXi hosts is very storage I/O intensive. Do this during maintenance window, or at a time of low I/O as this will perform MAJOR I/O on your hosts…

I issue the command (replace “DATASTORENAME” with the name of your datastore):

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=8

This could run for hours, possibly days depending on your “reclaim-unit” size (this is the block size of the unit you’re trying to reclaim from the VMFS file-system). In this example I choose 8, but most people do something larger like 100, or 200 to reduce the load and time for the command to complete (lower values look for smaller chunks of free space, so the command takes longer to execute).

I let this run for 2 hours on a 10TB datastore, however it may take way longer (possibly 6+ hours, to days).

Finally, not only are we are left with a smaller vmdk file, but we’ve released the space back to the SAN as well!

Jan 142018
 

The Problem

In the latest updates and versions of Microsoft Office 2016, I found a bug where when a user adds a new on-premise Microsoft Exchange 2016 account, it will repeatedly prompt for a username and password and ultimately fail if you hit cancel (no matter how many times you enter credentials). This was on the internal LAN on a domain joined workstation.

I did the usual checks:

  • Check Virtualdirectory configuration on Exchange
  • Check Virtualdirectory configuration on IIS (Internet Information Services)
  • Check Autodiscover DNS entries, InternalURL and ExternalURL configuration
  • Check for SCP inside of domain

All the of the above came back fine and were configured properly.

I have numerous other Outlook 2016 clients configured and working (installed as older versions, but have been updated), so I used those to troubleshoot (same scenario, domain joined on internal LAN and external WAN). After spending 10 hours ripping apart everything, confirming configuration, I noticed that when using the “Test Email Autoconfiguration…” (holding CTRL while right clicking on Outlook tray icon), that the e-mail clients had a skewed order for checking autodiscovery.

The e-mail clients were actually trying to authenticate with Office365 before my own on-premise Exchange Server (domain SCP or autodiscover records). This is absolutely bizarre! After spending 2 hours googling (I couldn’t find anything), I finally stumbled across this document and found an interesting piece of information:

https://support.microsoft.com/en-ca/help/3211279/outlook-2016-implementation-of-autodiscover

“Outlook uses a set of heuristics to determine whether the user account provided comes from Office 365. If Outlook determines confidently that you are an O365 user, a try is made to retrieve the Autodiscover payload from the known O365 endpoints (typically https://autodiscover-s.outlook.com/autodiscover/autodiscover.xml or https://autodiscover-s.partner.outlook.cn/autodiscover/autodiscover.xml). If this step does not retrieve a payload, Outlook moves to step 5.”

WTF?!?!?

So while this doesn’t explain why this happened, it explains what’s happening. I believe this is what’s happening as my working clients are trying to Autodisocver with Office365 first…

I went ahead an created a registry value to control the policy for “ExcludeExplicitO365Endpoint“. After configuring the registry key, I noticed that Autodiscover was now functioning properly and checking SCP and autodiscover DNS records first. I have no idea why the “heuristics” determined I was an Office365 user, but I’m not (I do have access to Office365 as a partner, but don’t use it and don’t have it configured). This may effect other partners, or users that utilize some O365 services…

The Fix

To fix this issue, create a text file and copy/paste this text below.

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\Microsoft\Office\16.0\Outlook\AutoDiscover]
"ExcludeExplicitO365Endpoint"=dword:00000001

Then save it, and rename it as ExcludeExplicitO365Endpoint.reg and run it (this will import the applicable registry key). ONLY DO THIS if you are using an Exchange On-Premise account, and not a Office365 or hosted exchange account.

After this, the issue was completely fixed. If you know what you’re doing, you can also use Outlook GPO settings and deploy this to a vast number of systems using Group Policy.

Jan 092018
 
HPe iLo Registered to Remote Support Insight Online

Many months ago, I configured the HPe Insight Online – Direct Connect on all my HPe Proliant DL360p Gen8 servers running VMware vSphere 6.5. This service is available with active support contracts (warranties), and allows your servers to “phone home” to HPe for free. This allows service and health information to be broadcast to your HPe passport and support account, to pro-actively manage, monitor, and maintain your servers. Information on the service can be found at https://www.hpe.com/ca/en/services/remote-it-support.html.

This is all pretty cool, but does it work? Read below!

I woke up this morning to notifications from my own monitoring system that a fan failure had occurred on one of my HPe Proliant server ESXi hosts. All my servers have fan redundancy so the server continued to run without problems. Scrolling through my other overnight e-mails, I also see e-mails from HPe acknowledging a support case that I had created. I had long since forgot that I configured Insight Online direct connect, so it actually took a few minutes for me to put two and two together. The server by itself took care of everything!

After reviewing all these e-mails, logging in to the HPe support portal, I had realized that the server by itself had:

  1. Identified a fan failure
  2. Sent diagnostic data off to HPe support
  3. Created an HPe support ticket and case
  4. HPe support engineers looked up the serial and part number of the server, and assigned a replacement part for the fan to be dispatched to me

I called in to HPe support, mentioned this was the first time this had ever happened and asked if there was anything additional I needed to provide. All the engineer asked, was whether I wanted an engineer to replace the part, or if I was comfortable replacing the part myself (of course I want to replace it myself). That was it!

This is VERY interesting and cool technology. I can see this being extremely valuable for customers who have 4 hour response contracts with their HPe equipment.

I’ve provided some screenshots below to show the process.

HPe Case Management E-Mail

HPe iLo Registered to Remote Support Insight Online

eRS Active Health Report Sent

HPe Remote Support Direct Connect Service Event

HPe Insight Online Automated Case

Jan 062018
 

Last night I updated my VMware VDI envionrment to VMware Horizon 7.4.0. For the most part the upgrade went smooth, however I discovered an issue (probably unrelated to the upgrade itself, and more so just previously overlooked). When connecting with Google Chrome to  VMware Horizon HTML Access via the UAG (Unified Access Gateway), an error pops up after pressing the button saying “Failed to connected to the connection server”.

The Problem:

This error pops up ONLY when using Chrome, and ONLY when connecting through the UAG. If you use a different browser (Firefox, IE), this issue will not occur. If you connect using Chrome to the connection server itself, this issue will not occur. It took me hours to find out what was causing this as virtually nothing popped up when searching for a solution.

Finally I stumbled across a VMware document that mentions on View Connection Server instances and security servers that reside behind a gateway (such as a UAG, or Access Point), the instance must be aware of the address in which browsers will connect to the gateway for HTML access.

The VMware document is here: https://docs.vmware.com/en/VMware-Horizon-7/7.0/com.vmware.horizon-view.installation.doc/GUID-FE26A9DE-E344-42EC-A1EE-E1389299B793.html

To resolve this:

On the view connection server, create a file called “locked.properties” in “install_directory\VMware\VMware View\Server\sslgateway\conf\”.

If you have a single UAG/Access Point, populate this file with:

portalHost=view-gateway.example.com

If you have multiple UAG/Access Points, populate the file with:

portalHost.1=view-gateway-1.example.com
portalHost.2=view-gateway-2.example.com

Restart the server

The issue should now be resolved!

On a side note, I also deleted my VMware Unified Access Gateways VMs and deployed the updated version that ship with Horizon 7.4.0. This means I deployed VMware Unified Access Gateway 3.2.0. There was an issue importing the configuration from the export backup I took from the previous version, so I had to configure from scratch (installing certificates, configuring URLs, etc…), be aware of this issue importing configuration.