Nov 172018
 

When running VMware vSphere 6 (and later) and ESXi on your hosts with VMFS6, you may notice that auto unmap (space reclamation) is not working even though it is enabled. In addition, you’ll find that manual unmap functions still work.

Why

This is because your storage array (SAN) may have a larger unmap granularity block size than 1MB. VMFS version 6 (source) requires an unmap granularity of 1MB, and does not support automatic unmap on arrays that are larger.

For example, on the HPe MSA 2040 the page size when using virtual storage is 4MB, hence auto unmap is not supported and does not work. You can still perform automatic unmap functions on arrays with block/page sizes larger than 1MB.

Additional Information and Resources

Perform manual VMFS unmap on vSphere 6.5 and 6.7 with VMFS 6 – https://www.stephenwagner.com/2017/02/07/vmfs-unmap-command-vsphere-6-5-with-vmfs-6-auto/

vSphere 6.5, 6.7 and VMFS 6 – Change storage reclaim priority from low to medium or high – https://www.stephenwagner.com/2017/02/08/vsphere-vmfs-6-change-storage-reclaim-priority-low-medium-high/

Release unused space on host and guest filesystems with thin-provisioned Sophos UTM appliance (SW) – https://www.stephenwagner.com/2018/01/18/release-unused-space-vmdk-thin-provisioned-sophos-utm/

Nov 042018
 

This weekend I came across a big issue with my HPe MSA 2040 where one of the SAN controllers became unresponsive, and appeared to had failed because it would not boot.

It all started when I decided to clean the MSA SAN. I try to clean the components once or twice a year to remove dust and make sure it’s not getting all jammed up. Sometimes I’ll shut the entire unit down and remove the individual components, other times I’ll remove them while operating. Because of the redundancies and since I have two controllers, I can remove and clean each controller individually at separate times.

Please Note: When dusting equipment with fans, never allow the fans to spin up with compressed air as this can generate current which can damage components. Never allow compressed air flow to spin up fans.

After cleaning out the power supply’s, it came time to clean the controllers.

The Problem

As always, I logged in to the SMU to shutdown controller A (storage). I shut it down, the blue LED illuminated it was safe for removal. I then proceeded to remove it, clean it, and re-insert it. The controller came back online, and ownership of the applicable disk groups were successfully moved back. Controller A was now completed successfully. I continued to do the same for controller B: I logged in to shutdown controller B (storage). It shut down just like controller A, the blue LED removable light illuminated. I was able to remove it, clean it, and re-insert it.

However, controller B did not come back online.

After inserting controller B, the status light was flashing (as if it was booting). I waited 20 minutes with no change. The SMU on controller B was responding to HTTPS requests, however you could not log on due to the error “system is initializing”. SSH was functioning and you could log in and issue commands, however any command to get information would return “Please wait while this information is pulled from the MC controller”, and ultimately fail. The SMU on controller A would report a controller fault on controller B, and not provide any other information (including port status on controller B).

I then tried to re-seat the controller with the array still running. Gave it plenty of time with no effect.

I then removed the failed controller, shutdown the unit, powered it back on (only with controller A), and re-inserted Controller B. Again, no effect.

The Fix

At this point I’m thinking the controller may have failed or died during the cleaning process. I was just about to call HPe support for a replacement until I noticed the “Power LED” light inside of the failed controller would flash every 5 seconds while removed.

This made me start to wonder if there was an issue writing the cache to the compact flash card, or if the controller was still running off battery power but had completely frozen.

I tried these 3 things on the failed controller while it was unplugged and removed:

  1. I left the controller untouched for 1 hour out of the array (to maybe let it finish whatever it was doing while on battery power)
  2. There’s an unlabeled button on the back of the controller. As a last resort (thinking it was a reset button), I pressed and held it for 20 seconds, waited a minute, then briefly pressed it for 1 second while it was out of the unit.
  3. I removed the Compact Flash card from the controller for 1 minute, then re-inserted it. In hoping this would fail the cache copy if it was stuck in the process of writing cache to compact flash.

I then re-inserted the controller, and it booted fine! It was not functioning and working (and came up very fast). Looking at the logs, it has no record of what occurred between the first shutdown, and final boot. I hope this post helps someone else with the same issue, it can save you a support ticket, and time with a controller down.

Disclaimer

PLEASE NOTE: I could not find any information on the unlabeled button on the controller, and it’s hard to know exactly what it does. Perform this at your own risk (make sure you have a backup). Since I have 2 controllers, and my MSA 2040 was running fine on Controller A, I felt comfortable doing this, as if this did reset controller B, the configuration would replicate back from controller A. I would not do this in a single controller environment.

Update – 24 Hours later

After I got everything up and running, I checked the logs of the unit and couldn’t find anything on controller B that looked out of ordinary. However, 24 hours later, I logged back in and noticed some new events showed up from the day before (from the day I had the issues):

MSA 2040 Code 549

MSA 2040 Code 549

You’ll notice the event log with severity error:

Recovery from internal processor fault detected on controller.
Code 549

One thing that’s very odd is that I know for a fact the time is wrong on the error severity log entry, this could be due to the fact we had a daylight savings time change last night at midnight. Either way it appears that it finally did detect that the Storage controller was in an error state and logged it, but it still would have been nice for some more information.

On a final note, the unit has been running perfectly for over 24 hours.