Nov 172018
 

When running VMware vSphere 6 or vSphere 7 and ESXi on your hosts with VMFS6, you may notice that auto unmap (space reclamation) is not working even though it is enabled. In addition, you’ll find that manual unmap functions still work.

Why is UNMAP not working

This is because your storage array (SAN) may have a larger unmap granularity block size than 1MB. VMFS version 6 (source) requires an unmap granularity of 1MB and does not support automatic unmap on arrays that are larger.

For example, on the HPE MSA 2040 the page size when using virtual storage is 4MB, hence auto unmap is not supported and does not work. You can still manually perform unmap on arrays with block/page sizes larger than 1MB.

Additional Information and Resources

Perform manual VMFS unmap on vSphere 6.5 and 6.7 with VMFS 6 – https://www.stephenwagner.com/2017/02/07/vmfs-unmap-command-vsphere-6-5-with-vmfs-6-auto/

vSphere 6.5, 6.7 and VMFS 6 – Change storage reclaim priority from low to medium or high – https://www.stephenwagner.com/2017/02/08/vsphere-vmfs-6-change-storage-reclaim-priority-low-medium-high/

Release unused space on host and guest filesystems with thin-provisioned Sophos UTM appliance (SW) – https://www.stephenwagner.com/2018/01/18/release-unused-space-vmdk-thin-provisioned-sophos-utm/

Nov 042018
 

This weekend I came across a big issue with my HPE MSA 2040 where one of the SAN controllers became unresponsive, and appeared to had failed because it would not boot.

It all started when I decided to clean the MSA SAN. I try to clean the components once or twice a year to remove dust and make sure it’s not getting all jammed up. Sometimes I’ll shut the entire unit down and remove the individual components, other times I’ll remove them while operating. Because of the redundancies and since I have two controllers, I can remove and clean each controller individually at separate times.

Please Note: When dusting equipment with fans, never allow the fans to spin up with compressed air as this can generate current which can damage components. Never allow compressed air flow to spin up fans.

After cleaning out the power supplies, it came time to clean the controllers.

The Problem

As always, I logged in to the SMU to shutdown controller A (storage). I shut it down, the blue LED illuminated it was safe for removal. I then proceeded to remove it, clean it, and re-insert it. The controller came back online, and ownership of the applicable disk groups were successfully moved back. Controller A was now completed successfully. I continued to do the same for controller B: I logged in to shutdown controller B (storage). It shut down just like controller A, the blue LED removable light illuminated. I was able to remove it, clean it, and re-insert it.

However, controller B did not come back online.

After inserting controller B, the status light was flashing (as if it was booting). I waited 20 minutes with no change. The SMU on controller B was responding to HTTPS requests, however you could not log on due to the error “system is initializing”. SSH was functioning and you could log in and issue commands, however any command to get information would return “Please wait while this information is pulled from the MC controller”, and ultimately fail. The SMU on controller A would report a controller fault on controller B, and not provide any other information (including port status on controller B).

I then tried to re-seat the controller with the array still running. Gave it plenty of time with no effect.

I then removed the failed controller, shutdown the unit, powered it back on (only with controller A), and re-inserted Controller B. Again, no effect.

The Fix

At this point I’m thinking the controller may have failed or died during the cleaning process. I was just about to call HPE support for a replacement until I noticed the “Power LED” light inside of the failed controller would flash every 5 seconds while removed.

This made me start to wonder if there was an issue writing the cache to the compact flash card, or if the controller was still running off battery power but had completely frozen.

I tried these 3 things on the failed controller while it was unplugged and removed:

  1. I left the controller untouched for 1 hour out of the array (to maybe let it finish whatever it was doing while on battery power)
  2. There’s an unlabeled button on the back of the controller. As a last resort (thinking it was a reset button), I pressed and held it for 20 seconds, waited a minute, then briefly pressed it for 1 second while it was out of the unit.
  3. I removed the Compact Flash card from the controller for 1 minute, then re-inserted it. In hoping this would fail the cache copy if it was stuck in the process of writing cache to compact flash.

I then re-inserted the controller, and it booted fine! It was not functioning and working (and came up very fast). Looking at the logs, it has no record of what occurred between the first shutdown, and final boot. I hope this post helps someone else with the same issue, it can save you a support ticket, and time with a controller down.

Disclaimer

PLEASE NOTE: I could not find any information on the unlabeled button on the controller, and it’s hard to know exactly what it does. Perform this at your own risk (make sure you have a backup). Since I have 2 controllers, and my MSA 2040 was running fine on Controller A, I felt comfortable doing this, as if this did reset controller B, the configuration would replicate back from controller A. I would not do this in a single controller environment.

Update – 24 Hours later

After I got everything up and running, I checked the logs of the unit and couldn’t find anything on controller B that looked out of ordinary. However, 24 hours later, I logged back in and noticed some new events showed up from the day before (from the day I had the issues):

MSA 2040 Code 549
MSA 2040 Code 549

You’ll notice the event log with severity error:

Recovery from internal processor fault detected on controller.
Code 549

One thing that’s very odd is that I know for a fact the time is wrong on the error severity log entry, this could be due to the fact we had a daylight savings time change last night at midnight. Either way it appears that it finally did detect that the Storage controller was in an error state and logged it, but it still would have been nice for some more information.

On a final note, the unit has been running perfectly for over 24 hours.

Update – April 2nd 2019

Well, in March a new firmware update was released for the MSA. I went to upgrade and the same issue as above occurred. During the firmware upgrade, at one point of the firmware update process a step had failed and repeated 4 times until successful.

The firmware update log (below was repeated):

Updating system configuration files
System configuration complete
Loading SC firmware.
STATUS: Updating Storage Controller firmware.
Waiting 5 seconds for SC to shutdown.
Shutdown of SC successful.
Sending new firmware to SC.
Updating SC Image:Remaining size 6263505
Updating SC Image:Remaining size 5935825
Updating SC Image:Remaining size 5608145
Updating SC Image:Remaining size 5280465
Updating SC Image:Remaining size 4952785
Updating SC Image:Remaining size 4625105
Updating SC Image:Remaining size 4297425
Updating SC Image:Remaining size 3969745
Updating SC Image:Remaining size 3642065
Updating SC Image:Remaining size 3314385
Updating SC Image:Remaining size 2986705
Updating SC Image:Remaining size 2659025
Updating SC Image:Remaining size 2331345
Updating SC Image:Remaining size 2003665
Updating SC Image:Remaining size 1675985
Updating SC Image:Remaining size 1348305
Updating SC Image:Remaining size 1020625
Updating SC Image:Remaining size 692945
Updating SC Image:Remaining size 365265
Updating SC Image:Remaining size 37585
Waiting for Storage Controller to complete programming.
Please wait...
Please wait...
Please wait...
Please wait...
Storage Controller has completed programming.
Got an error (138) on firmware packet
CAPI error: Firmware Update failed. Controller needs to reboot.
Waiting 5 seconds for SC to shutdown.
Shutdown of SC successful.
Sending new firmware to SC.
Updating SC Image:Remaining size 6263505
Updating SC Image:Remaining size 5935825
Updating SC Image:Remaining size 5608145
Updating SC Image:Remaining size 5280465
Updating SC Image:Remaining size 4952785
Updating SC Image:Remaining size 4625105
Updating SC Image:Remaining size 4297425
Updating SC Image:Remaining size 3969745
Updating SC Image:Remaining size 3642065
Updating SC Image:Remaining size 3314385
Updating SC Image:Remaining size 2986705
Updating SC Image:Remaining size 2659025
Updating SC Image:Remaining size 2331345
Updating SC Image:Remaining size 2003665
Updating SC Image:Remaining size 1675985
Updating SC Image:Remaining size 1348305
Updating SC Image:Remaining size 1020625
Updating SC Image:Remaining size 692945
Updating SC Image:Remaining size 365265
Updating SC Image:Remaining size 37585
Waiting for Storage Controller to complete programming.
Please wait...
Please wait...
Storage Controller has completed programming.
Got an error (138) on firmware packet
CAPI error: Firmware Update failed. Controller needs to reboot.
Waiting 5 seconds for SC to shutdown.
Shutdown of SC successful.
Sending new firmware to SC.
Updating SC Image:Remaining size 6263505
Updating SC Image:Remaining size 5935825
Updating SC Image:Remaining size 5608145
Updating SC Image:Remaining size 5280465
Updating SC Image:Remaining size 4952785
Updating SC Image:Remaining size 4625105
Updating SC Image:Remaining size 4297425
Updating SC Image:Remaining size 3969745
Updating SC Image:Remaining size 3642065
Updating SC Image:Remaining size 3314385
Updating SC Image:Remaining size 2986705
Updating SC Image:Remaining size 2659025
Updating SC Image:Remaining size 2331345
Updating SC Image:Remaining size 2003665
Updating SC Image:Remaining size 1675985
Updating SC Image:Remaining size 1348305
Updating SC Image:Remaining size 1020625
Updating SC Image:Remaining size 692945
Updating SC Image:Remaining size 365265
Updating SC Image:Remaining size 37585
Waiting for Storage Controller to complete programming.
Please wait...
Please wait...
Storage Controller has completed programming.
Got an error (138) on firmware packet
CAPI error: Firmware Update failed. Controller needs to reboot.
Waiting 5 seconds for SC to shutdown.
Shutdown of SC successful.
Sending new firmware to SC.
Updating SC Image:Remaining size 6263505
Updating SC Image:Remaining size 5935825
Updating SC Image:Remaining size 5608145
Updating SC Image:Remaining size 5280465
Updating SC Image:Remaining size 4952785
Updating SC Image:Remaining size 4625105
Updating SC Image:Remaining size 4297425
Updating SC Image:Remaining size 3969745
Updating SC Image:Remaining size 3642065
Updating SC Image:Remaining size 3314385
Updating SC Image:Remaining size 2986705
Updating SC Image:Remaining size 2659025
Updating SC Image:Remaining size 2331345
Updating SC Image:Remaining size 2003665
Updating SC Image:Remaining size 1675985
Updating SC Image:Remaining size 1348305
Updating SC Image:Remaining size 1020625
Updating SC Image:Remaining size 692945
Updating SC Image:Remaining size 365265
Updating SC Image:Remaining size 37585
Waiting for Storage Controller to complete programming.
Please wait...
Please wait...
Storage Controller has completed programming.
Updating SC Image:Remaining size 0
Storage Controller has been successfully updated.
STATUS: Current CPLD firmware is up-to-date.
CPLD update not required.
==========================================
Software Component Load Summary:
MC Software:    SUCCESSFUL
SC Software:    SUCCESSFUL
EC Software:    NOT ATTEMPTED
CPLD Software:  NOT ATTEMPTED
==========================================

During the Storage Controller restarting process, the controller never came back up. I removed the controller 1 hour, re-inserted and the above fix did not work. I then tried it after 2 hours of disconnection.

At this point I contacted HPE, who is sending a replacement controller.

The following day (12 hours of controller removed), I re-inserted it again and it actually booted up, was working with the new firmware, and then did a PFU (Partner Firmware Update) of controller A.

While it is working now, I’m still going to replace the controller as I believe something is not functioning correctly.