Feb 072017
 

With vSphere 6.5 came VMFS 6, and with VMFS 6 came the auto unmap feature. This is a great feature, and very handy for those of you using thin provisioning on your datastores hosted on storage that supports VAAI. However, you still have the ability to perform a manual UNMAP at high priority, even with VMware vSphere 7 and vSphere 8.

A while back, I noticed something interesting when running the manual unmap command for the first time. It isn’t well documented, but I thought I’d share for those of you who are doing a manual LUN unmap for the first time. This document will also provide you with the command to perform a manual unmap on a VMFS datastore.

Reason:

Automatic unmap (auto space reclamation) is on, however you want to speed it up or have a large chunk of block’s you want unmapped immediately, and don’t want to wait for the auto feature.

Problem:

I wasn’t noticing any unmaps were occurring automatically and I wanted to free up some space on the SAN, so I decided to run the old command to forcefully run the unmap to free up some space:

esxcli storage vmfs unmap --volume-label=DATASTORENAME --reclaim-unit=200

(The above command runs a manual unmap on a datastore)

After kicking it off, I noticed it wasn’t completing as fast as I thought it should be. I decided to enable SSH on the host and took a look at the /var/log/hostd.log file. To my surprise, it wasn’t stopping at a 200 block reclaim, it just kept cycling running over and over (repeatedly doing 200 blocks):

2017-02-07T14:12:37.365Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:37.978Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:38.585Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.191Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:39.808Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:40.426Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.050Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:41.659Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.275Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-9XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX
2017-02-07T14:12:42.886Z info hostd[XXXXXXXX] [Originator@XXXX sub=Libs opID=esxcli-fb-XXXX user=root] Unmap: Async Unmapped 200 blocks from volume XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXX

That’s just a small segment of the logs, but essentially it just kept repeating the unmap/reclaim over and over in 200 block segments. I waited hours, tried to issue a “CTRL+C” to stop it, however it kept running.

I left it to run overnight and it did eventually finish while I was sleeping. I’m assuming it attempted to unmap everything it could across the entire datastore. Initially I thought this command would only unmap the specified block size.

When running this command, it will continue to cycle in the block size specified until it goes through the entire LUN. Be aware of this when you’re planning on running the command.

Essentially, I would advise not to manually run the unmap command unless you’re prepared to unmap and reclaim ALL your unused allocated space on your VMFS 6 datastore. In my case I did this because I had 4TB of deleted data that I wanted to unmap immediately, and didn’t want to wait for the automatic unmap.

I thought this may have been occurring because the automatic unmap function was on, so I tried it again after disabling auto unmap. The behavior was the same and it just kept running.

If you are tempted to run the unmap function, keep in mind it will continue to scan the entire volume (despite what block count you set). With this being said, if you are firm on running this, choose a larger block count (200 or higher) since smaller blocks will take forever (tested with a block size of 1 and after analyzing the logs and rate of unmaps, it would have taken over 3 months to complete on a 9TB array).

Update May 11th 2018: When running the manual unmap command with smaller “reclaim-unit” values (such as 1), your host may become unresponsive due to a memory overflow. vMotion’s will cease to function, and your ESXi host may need a restart to become fully functional. I’ve experienced this behavior twice. I highly suggest that if you perform this command, you do so while the host is in maintenance mode, and that your restart the host after a successful unmap sweep.

Feb 062017
 

Had a nasty little surprise with one of my clients this afternoon. Two days ago I updated their Sophos UTM (UTM220) to version 9.410-6 without any issues.

However, today I started to receive notifications that services were crashing (specifically ACC device agent).

After receiving a few of these, I logged in to check it out. Immediately there was no visible errors on the UTM itself, but after some further digging, I noticed these event logs in the “System Messages” log file:

2017:02:06-17:09:32 mail partitioncleaner[7918]: automatic cleaning for partition /tmp started (inodes: 0/100 blocks: 100/85)

2017:02:06-17:09:32 mail partitioncleaner[7918]: stopping deletion: can’t delete more files

Looks like a potential storage problem? Yes it was, but slightly more complicated.

I enabled SSH on the UTM and issued the “df” command (show’s volume usage), and found that the /tmp volume was 100% full.

Doing a “ls” and “ls -hl”, I found there were 25+ files that were around 235MB in size called: “AV-malware-names-XXXX-XXXXXX”.

Restarting the unit clears those files, however they come back shortly (I noticed it would add one every 5-10 minutes).

After some further digging (still haven’t heard back from Sophos on the support case), I came across some other users experiencing the same issues. While no one found a permanent resolution, they did mention this had to do with the Avira AV engine or possibly the dual scan engine.

Checking the UTM, I noticed that we had the E-Mail scanning configured for dual scan.

Solution (temporary workaround):

I went ahead and configured the E-Mail scanner (the only scanner I had that was using dual scan) to use single scan only. I then restarted the UTM. In my environment the default setting for single scanning is set to “Sophos”.

I am now sitting here with 30 minutes of uptime and absolutely no “AV-malware-names-XXXX-XXXXXX” files created.

I will post an update when I hear back from Sophos support.

Hope this helps someone else!

 

Update (after original post):

I heard back from Sophos support, this is a known bug in 9.410. The current official workaround is to change to single scan and use the AVIRA engine instead of the Sophos engine.

Update #2:

Received notification this morning of a new firmware update available (Version: 9.411003 – Maintenance Release). While I haven’t installed it, it appears from the Bugfixes notes that it was released to fix this issue:

 Fix [NUTM-6804]: [AWS] Update breaks HVM standalone installations
Fix [NUTM-6747]: [Email] SAVI scanner coredumps permanently in MailProxy after update to 9.410
Fix [NUTM-6802]: [Web] New coredumps from httpproxy after update to v9.410

Update #3:

I noticed that this bug was interrupting some mailflow on my Sophos UTM, as well as some of my clients. I went ahead and as an emergency situation, installed 9.411-3.

Things were fine for around 10 hours until I started to receive notification of the HTTP proxy failing and requiring restart. Logging in to the UTM, it was very unresponsive, sometimes completely unresponsive for around 10 minutes. Web browsing was not functioning at all on the internal network behind the UTM.

This issue still hasn’t been resolved. Hopefully we see a stable working fix sometime soon.

Jan 272017
 

Greetings everyone!

I had my first predicted disk failure occur on my HPE MSA 2040. As always, it was a breeze contacting HPE support to get the drive replaced (since my unit has a 4 hour response warranty).

However, with this being my first drive swap I came across something worth mentioning. Typically in RAID arrays when a disk fails, you simply swap out the failed disk and it starts rebuilding, this is NOT the case if you have an HPE MSA 2040 that’s fully loaded with no spares configured.

If you have global spares, the moment the disk is failed, it will automatically rebuild on to available configured spares.

If you don’t have any global spares (my case), the replacement disk is marked as unused and available. You must set this disk as a spare in the SMU for the rebuild to start.

One additional note, if you do have spares and a disk fails, when you replace the disk that failed it will not automatically rebuild that disk back from the spare. You must force fail (pull out) the spare disk for it to start rebuilding on the freshly replaced disk. Always confirm current redundancy levels and activity before forcefully failing any disks!

As per HPE’s MSA 1040/2040 Best Practices document:

Source: https://h50146.www5.hpe.com/products/storage/whitepaper/pdfs/4AA4-6892ENW.pdf

Dec 082016
 

So you just completed your migration from an earlier version of vSphere up to vSphere 6.5 (particularly vCenter 6.5 Virtual Appliance). When trying to log in to the vSphere web client, you receive numerous “The VMware enhanced authentication plugin has updated it’s SSL certificate in Firefox. Please restart Firefox.”. You’ll usually see 2 of these messages in a row on each page load.

You’ll also note that the “Enhanced Authentication Plugin” doesn’t function after the install (it won’t pull your Active Directory authentication information).

To resolve this:

Uninstall all vSphere plugins from your workstation. I went ahead and uninstalled all vSphere related software on my workstation, this includes the deprecated vSphere C# client application, all authentication plugins, etc… These are all old.

Open up your web browser and point to your vCenter server (https://vCENTERSERVERNAME), and download the “Trusted root CA certificates” from VMCA (VMware certificate authority).

Download and extract the ZIP file. Navigate through the extracted contents to the windows certs. These root CA certificates need to be installed to your “Trusted Root Certification Authorities” store on your system, and make sure you skip the “Certificate Revocation List” file which ends in a “.r0”.

To install them, right click, choose “Install Certificate”, choose “Local Machine”, yes to UAC prompt, then choose “Place all certificates in the following store”, browse, and select “Trusted Root Certification Authorities”, and finally finish. Repeat for each of the certificates. Your workstation will now “trust” all certificates issued by your VMware Certificate Authority (VMCA).

You can now re-open your web browser, download the “Enhanced Authentication Plugin” from your vCenter instance, and install. After restarting your computer, the plugin should function and the messages will no longer appear.

Leave a comment!

Dec 072016
 

Well, I start writing this post minutes after completing my first vSphere 6.0 upgrade to vSphere 6.5, and as always with VMware products it went extremely smooth (although with any upgrade there are minor hiccups).

Thankfully with the evolution of virtualization technology, upgrades such as the upgrade to vSphere 6.5 is such a massive change to your infrastructure, yet the process is extremely simplified, can be easily rolled out, and in the event of problems has very simple clear paths to revert back and re-attempt. Failed upgrades usually aren’t catastrophic, and don’t even affect production environments.

Whenever I do these vSphere upgrades, I find it funny how you’re making such massive changes to your infrastructure with each click and step, yet the thought process and understanding behind it is so simple and easy to follow. Essentially, after one of these upgrades you look back and think: “Wow, for the little amount of work I did, I sure did accomplish a lot”. It’s just one of the beauties of virtualization, especially holding true with VMware products.

To top it all off you can complete the entire upgrade/migration without even powering off any of your virtual machines. You could do this live, during business hours, in a production environment… How cool is that!

Just to provide some insights in to my environment, here’s a list of the hardware and configuration:

-2 X HPE Proliant DL360p Gen8 Servers (each with dual processors, and each with 128GB RAM, no local storage)

-1 X HPE MSA2040 Dual Controller SAN (each host has multiple connections to the SAN via 10Gb DAC iSCSI, 1 connection to each of the dual controllers)

-VMware vSphere 6.0 running on Windows Virtual Machine (Windows Server 2008 R2)

-VMware Update Manager (Running on the same server as the vCenter Server)

-VMware Data Protection (2 x VMware vDP Appliances, one as a backup server, one as a replication target)

-VMware ESXi 6.0 installed on to SD-cards in the servers (using HPE Customized ESXi installation)

One of the main reasons why I was so quick to adopt and migrate to vSphere 6.5, was I was extremely interested in the prospect of migrating a Windows based vCenter instance, to the new vCenter 6.5 appliance. This is handy as it simplifies the environment, reduces licensing costs and requirements, and reduces time/effort on server administration and maintenance.

First and foremost, following the recommended upgrade path (you have to specifically do the upgrades and migrations for all the separate modules/systems in a certain order), I had to upgrade my vDP appliances first. For vDP to support vCenter 6.5, you must upgrade your vDP appliances to 6.1.3. As with all vDP upgrades, you must shut down the appliance, mark all the data disks as dependent, take a snapshot, and mount the upgrade ISO, and then boot and initiate the upgrade from the appliance web interface. After you complete the upgrade and confirm the appliance is functioning, you shut down the appliance, remove the snapshot, mark all data disks as independent (except the first Virtual disk, you only mark virtual disk 2+ and up as independent), and you’re done your upgrade.

A note on a problem I dealt with during the upgrade process for vDP to version 6.1.3 (appliance does not detect mounted ISO image) can be found here: http://www.stephenwagner.com/?p=1107

Moving on to vCenter! VMware did a great job with this. You load up the VMware Migration Assistant tool on your source vCenter server, load up the migration/installation application on a separate computer (the workstation you’re using), and it does the rest. After prepping the destination vCenter appliance, it exports the data from the source server, copies it to the destination server, shuts down the source VM, and then imports the data to the destination appliance and takes over the role. It’s the coolest thing ever watching this happen live. Upon restart, you’ve completed your vCenter Server migration.

A note on a problem I dealt with during the migration process (which involved exporting VMware Update Manager from the source server) can be found here: http://www.stephenwagner.com/?p=1115

And as for the final step, it’s now time to upgrade your ESXi hosts to version 6.5. As always, this is an easy task with VMware Update Manager, and can be easily and quickly rolled out to multiple ESXi hosts (thanks to vMotion and DRS). After downloading your ESXi installation ISO (in my case I use the HPE customized image), you upload it in to your new VMware Update Manager instance, add it to an upgrade baseline, and then attach the baseline to your hosts. To push this upgrade out, simply select the cluster or specific host (depending on if you want to rollout to a single host, or multiple at once), and remediate! After a couple restarts the upgrade is done.

A note on a problem I dealt with during ESXi 6.5 upgrade (conflicting VIBs marking image as incompatible when deploying HPE customized image) can be found here: http://www.stephenwagner.com/?p=1120

After all of the above, the entire environment is now running on vSphere 6.5! Don’t forget to take a backup before and after the upgrade, and also upgrade your VM hardware versions to 6.5 (VM compatibility version), and upgrade VMware tools on all your VMs.

Make sure to visit https://YOURVCENTERSERVER to download the VMware Certificate Authority (VMCA) root certificates, and add them to the “Trusted Root Certification Authorities” on your workstation so you can validate all the SSL certs that vCenter uses. Also, note that the vSphere C# client (the windows application) has been deprecated, and you now must use the vSphere Web Client, or the new HTML5 web client.

Happy Virtualizing! Leave a comment!