You may encounter a situation where you’re unable to connect to the management interface or NIC on your HPE MSA array. When this condition occurs, you are not able to ping the NIC, and the SMU (web interface) will not load.
When you visibly look at the array, the AMBER warning light may or may not be flashing.
If you have a dual controller setup, and connect to the SMU on the other controller, you may see numerous log entries where the management NIC port status changes repeatedly from up to down.
I’ve witnessed this issue occur on 2 separate HPE MSA 2040 storage arrays (both with dual controllers).
When you physically look at the management NICs on the controller in question, you’ll notice that the port status LED indicator turns on, and turns off repeatedly. The link status keeps changing from up to down (as reflected in the logs).
Restarting the unit will have no effect. Changing the network cable will have no effect.
To resolve this issue, you must play with the network cable and re-seat it a few times (possibly half-way if possible a couple times as sketchy as that sounds).
If you can get the link status up, and disconnect/reconnect the cable before the light turns off, the connection will stay up. It will continue to function and survive restarts until sometime in the future when you disconnect it and reconnect it.
Replacing the controller may also fix it, however in the first instance I observed this, the replacement controller exhibited the same behavior months later in the future.
This weekend I came across a big issue with my HPE MSA 2040 where one of the SAN controllers became unresponsive, and appeared to had failed because it would not boot.
It all started when I decided to clean the MSA SAN. I try to clean the components once or twice a year to remove dust and make sure it’s not getting all jammed up. Sometimes I’ll shut the entire unit down and remove the individual components, other times I’ll remove them while operating. Because of the redundancies and since I have two controllers, I can remove and clean each controller individually at separate times.
Please Note: When dusting equipment with fans, never allow the fans to spin up with compressed air as this can generate current which can damage components. Never allow compressed air flow to spin up fans.
After cleaning out the power supplies, it came time to clean the controllers.
As always, I logged in to the SMU to shutdown controller A (storage). I shut it down, the blue LED illuminated it was safe for removal. I then proceeded to remove it, clean it, and re-insert it. The controller came back online, and ownership of the applicable disk groups were successfully moved back. Controller A was now completed successfully. I continued to do the same for controller B: I logged in to shutdown controller B (storage). It shut down just like controller A, the blue LED removable light illuminated. I was able to remove it, clean it, and re-insert it.
However, controller B did not come back online.
After inserting controller B, the status light was flashing (as if it was booting). I waited 20 minutes with no change. The SMU on controller B was responding to HTTPS requests, however you could not log on due to the error “system is initializing”. SSH was functioning and you could log in and issue commands, however any command to get information would return “Please wait while this information is pulled from the MC controller”, and ultimately fail. The SMU on controller A would report a controller fault on controller B, and not provide any other information (including port status on controller B).
I then tried to re-seat the controller with the array still running. Gave it plenty of time with no effect.
I then removed the failed controller, shutdown the unit, powered it back on (only with controller A), and re-inserted Controller B. Again, no effect.
At this point I’m thinking the controller may have failed or died during the cleaning process. I was just about to call HPE support for a replacement until I noticed the “Power LED” light inside of the failed controller would flash every 5 seconds while removed.
This made me start to wonder if there was an issue writing the cache to the compact flash card, or if the controller was still running off battery power but had completely frozen.
I tried these 3 things on the failed controller while it was unplugged and removed:
I left the controller untouched for 1 hour out of the array (to maybe let it finish whatever it was doing while on battery power)
There’s an unlabeled button on the back of the controller. As a last resort (thinking it was a reset button), I pressed and held it for 20 seconds, waited a minute, then briefly pressed it for 1 second while it was out of the unit.
I removed the Compact Flash card from the controller for 1 minute, then re-inserted it. In hoping this would fail the cache copy if it was stuck in the process of writing cache to compact flash.
I then re-inserted the controller, and it booted fine! It was not functioning and working (and came up very fast). Looking at the logs, it has no record of what occurred between the first shutdown, and final boot. I hope this post helps someone else with the same issue, it can save you a support ticket, and time with a controller down.
PLEASE NOTE: I could not find any information on the unlabeled button on the controller, and it’s hard to know exactly what it does. Perform this at your own risk (make sure you have a backup). Since I have 2 controllers, and my MSA 2040 was running fine on Controller A, I felt comfortable doing this, as if this did reset controller B, the configuration would replicate back from controller A. I would not do this in a single controller environment.
Update – 24 Hours later
After I got everything up and running, I checked the logs of the unit and couldn’t find anything on controller B that looked out of ordinary. However, 24 hours later, I logged back in and noticed some new events showed up from the day before (from the day I had the issues):
You’ll notice the event log with severity error:
Recovery from internal processor fault detected on controller.
One thing that’s very odd is that I know for a fact the time is wrong on the error severity log entry, this could be due to the fact we had a daylight savings time change last night at midnight. Either way it appears that it finally did detect that the Storage controller was in an error state and logged it, but it still would have been nice for some more information.
On a final note, the unit has been running perfectly for over 24 hours.
Update – April 2nd 2019
Well, in March a new firmware update was released for the MSA. I went to upgrade and the same issue as above occurred. During the firmware upgrade, at one point of the firmware update process a step had failed and repeated 4 times until successful.
During the Storage Controller restarting process, the controller never came back up. I removed the controller 1 hour, re-inserted and the above fix did not work. I then tried it after 2 hours of disconnection.
At this point I contacted HPE, who is sending a replacement controller.
The following day (12 hours of controller removed), I re-inserted it again and it actually booted up, was working with the new firmware, and then did a PFU (Partner Firmware Update) of controller A.
While it is working now, I’m still going to replace the controller as I believe something is not functioning correctly.
So, what happens in a worst-case scenario where your backup system fails, you don’t have any VM snapshots, and the last thing standing in the way of complete data loss is your SAN storage systems LUN snapshots?
Well, first you fire whoever purchased and implemented the backup system, then secondly you need to start restoring the VM (or VMs) from your SAN LUN snapshots.
While I’ve never had to do this in the past (all the disaster recovery solutions I’ve designed and sold have been tested and function), I’ve always been curious what the process is and would be like. Today I decided to try it out and develop a procedure for restoring a VM from SAN Storage LUN snapshot.
For this test I pretended a VM was corrupt on my VMware vSphere cluster and then restored it to a previous state from a LUN snapshot on my HPE MSA 2040 (identical for the HPE MSA 2050, and MSA 2052) Dual Controller SAN.
To accomplish the restore, we’ll need to create a host mapping on the SAN for the LUN snapshot to a new LUN number available to the hosts. We then need to add and mount the VMFS volume (residing on the snapshot) to the host(s) while assigning it a new signature and then vMotion the VM from the snapshot’s VMFS to original datastore.
Important Notes (Read first):
When mounting a VMFS volume from a SAN snapshot, you MUST RE-SIGNATURE THE SNAPSHOT VMFS volume. Not doing so can cause problems.
The snapshot cannot be mapped as read only, VMFS volumes must be marked as writable in order to be mounted on ESXi hosts.
You must follow the proper procedure to gracefully dismount and detach the VMFS volume and storage device before removing the snapshot’s host mapping on the SAN.
We use Storage vMotion to perform a high-speed move and recovery of the VM. If you’re not licensed for Storage vMotion, you can use the datastore file browser and copy/move from the snapshot VMFS volume to live production VMFS volume, however this may be slower.
During this entire process you do not touch, modify, or change any settings on your existing active production LUNs (or LUN numbers).
Restoring a VM from a SAN LUN snapshot will restore a crash consistent copy of the VM. The VM when recovered will believe a system crash occurred and power was lost. This is NOT a graceful application consistent backup and restore.
Please read your SAN documentation for the procedure to access SAN snapshots, and create host mappings. With the MSA 2040 I can do this live during production, however your SAN may be different and your hosts may need to be powered off and disconnected while SAN configuration changes are made.
Pro tip: You can also power on and initialize the VM from the snapshot before initiating the storage vMotion. This will allow you to get production services back online while you’re moving the VM from the snapshot to production VMFS volumes.
I’m not responsible if you damage, corrupt, or cause any damage or issues to your environment if you follow these procedures.
We are assuming that you have already either deleted the damaged VM, or removed it from your inventory and renamed the VMs folder on the live VMFS datastore to change the name (example, renaming the folder from “SRV01” to “SRV01.bad”. If you renamed the damaged VM, make sure you have enough space for the new restored VM as well.
Mount the VMFS volume on the LUN snapshot to the ESXi host(s)
Identify the VM you want to recover, write it down.
Identify the datastore that the VM resides on, write it down.
Identify the SAN and identify the LUN number that the VMFS datastore resides on, write it down.
Identify the LUN Snapshot unique name/id/number and write it down, confirm the timestamp to make sure it will contain a valid recovery point.
Log on to the SAN and create a host mapping to present the snapshot (you recorded above) to the hosts using a new and unused LUN number.
Log on to your ESXi host and navigate to configuration, then storage adapters.
Select the iSCSI initator and click the “Rescan Storage Adapters” button to rescan all iSCSI LUNs.
VMware ESXi Host Rescan Storage Adapter
Ensure both check boxes are checked and hit “Ok”, wait for the scan to complete (as shown in the “Recent Tasks” window.
VMware ESXi Host Rescan Storage Adapter Window for VMFS Volume and Devices
Now navigate to the “Datastores” tab under configuration, and click on the “Create a new Datastore” button as shown below.
VMware ESXi Host Add Datastore Window
Continue with “VMFS” selected and select continue.
In the next window, you’ll see your existing datastores, as well as your new datastore (from the snapshot). You can leave the “Datastore name” as is since this value will be ignored. In this window you’re going to select the new VMFS datastore from the snapshot. Make sure you confirm this by looking at the LUN number, as well as the value under “SnapshotVolume”. It is critical that you select the snapshot in this window (it should be the new LUN number you added above).
Select next and continue.
On the next window “Mount Option”, you need to change the radio button to and select “Assign a new signature”. This is critical! This will assign a new signature to differentiate it from your existing real production datastore so that the ESXi hosts don’t confuse it.
Continue with the wizard and complete the mount process. At this point ESXi will resignture the VMFS volume and rename it to “snap-OriginalVolumeNameHere”.
You can now browse the VMFS datastore residing on the LUN snapshot and do anything you’d normally be able to do with a normal datastore.
Copy/Move/vMotion the VM from the snapshot VMFS volume to your production VMFS volume
Note: The next steps are only if you are licensed for storage vMotion. If you aren’t you’ll need to use the copy or move function in the file browsing area to copy or move the VMs to your live production VMFS datastores:
Now we’ll go to the vCenter/ESXi host storage area in the web client, and using the “Files” tab, we’ll browse the snapshots VMFS datastore that we just mounted.
Locate the folder for the VM(s) you want to recover, open the folder, right click on the vmx file for the VM and select “Register VM”. Repeat this for any of the VMs you want to recover from the snapshot. Complete the wizard for each VM you register and add it to a host.
Go back to you “Hosts and VMs” view, you’ll now see the VMs are added.
Select and right click on the VM you want to move from the snapshot datastore to your production live datastore, and select “Migrate”.
In the vMotion migrate wizard, select “Change Storage only”.
Continue to the wizard, and storage vMotion the VM from the snapshot VMFS to your production VMFS volume. Wait for the vMotion to complete.
After the storage vMotion is complete, boot the VM and confirm everything is functioning.
Gracefully unmount, detach, and remove the snapshot VMFS from the ESXi host, and then remove the host mapping from the SAN
On each of your ESXi hosts that have access to the SAN, go to the “Datastores” section under the ESXi hosts configuration, right click on the snapshot VMFS datastore, and select “Unmount”. You’ll need to repeat this on each ESXi host that may have automounted the snapshot’s VMFS volume.
On each of your ESXi hosts that have access to the SAN, go to the “Storage Devices” section under the ESXi hosts configuration and identify (by LUN number) the “disk” that is the snapshot LUN. Select and highlight the snapshot LUN disk, select “All Actions” and select “Detach”. Repeat this on each host.
Double check and confirm that the snapshot VMFS datastore (and disk object) have been unmounted and detached from each ESXi host.
You can now log in to your SAN and remove the host mapping for the snapshot-to-LUN. We will not longer present the snapshot LUN to any of the hosts.
Back to the ESXi hosts, navigate to “Storage Adapters”, select the “iSCSI Initiator Adapter”, and click the “Rescan Storage Adapters”. Repeat this for each ESXi host.
There’s a new and easier way to find the latest firmware for your HPE MSA SAN!
A new website setup by HPE allows you to find the latest firmware for your HPE MSA 2050/2052, MSA 1050, MSA 2040/2042/1040, and/or MSA P2000 G3. This site will include the last 3 generations of SANs in the MSA product line.
For the server, we purchased another HPE Proliant DL360p Gen8 (with 2 X 10 Core Processors, and 128Gb of RAM, exact same as our existing server), however I won’t be getting that in to this blog post.
Now for storage, we decided to pull the trigger and purchase an HPE MSA 2040 Dual Controller SAN. We purchased it as a CTO (Configure to Order), and loaded it up with 4 X 1Gb iSCSI RJ45 SFP+ modules (there’s a minimum requirement of 1 4-pack SFP), and 24 X HPE 900Gb 2.5inch 10k RPM SAS Dual Port Enterprise drives. Even though we have the 4 1Gb iSCSI modules, we aren’t using them to connect to the SAN. We also placed an order for 4 X 10Gb DAC cables.
To connect the SAN to the servers, we purchased 2 X HPE Dual Port 10Gb Server SFP+ NICs, one for each server. The SAN will connect to each server with 2 X 10Gb DAC cables, one going to Controller A, and one going to Controller B.
HPE MSA 2040 Configuration
I must say that configuration was an absolute breeze. As always, using intelligent provisioning on the DL360p, we had ESXi up and running in seconds with it installed to the on-board 8GB micro-sd card.
I’m completely new to the MSA 2040 SAN and have actually never played with, or configured one. After turning it on, I immediately went to HPE’s website and downloaded the latest firmware for both the drives, and the controllers themselves. It’s a well known fact that to enable iSCSI on the unit, you have to have the controllers running the latest firmware version.
Turning on the unit, I noticed the management NIC on the controllers quickly grabbed an IP from my DHCP server. Logging in, I found the web interface extremely easy to use. Right away I went to the firmware upgrade section, and uploaded the appropriate firmware file for the 24 X 900GB drives. The firmware took seconds to flash. I went ahead and restarted the entire storage unit to make sure that the drives were restarted with the flashed firmware (a proper shutdown of course).
While you can update the controller firmware with the web interface, I chose not to do this as HPE provides a Windows executable that will connect to the management interface and update both controllers. Even though I didn’t have the unit configured yet, it’s a very interesting process that occurs. You can do live controller firmware updates with a Dual Controller MSA 2040 (as in no downtime). The way this works is, the firmware update utility first updates Controller A. If you have a multipath (MPIO) configuration where your hosts are configured to use both controllers, all I/O is passed to the other controller while the firmware update takes place. When it is complete, I/O resumes on that controller and the firmware update then takes place on the other controller. This allows you to do online firmware updates that will result in absolutely ZERO downtime. Very neat! PLEASE REMEMBER, this does not apply to drive firmware updates. When you update the hard drive firmware, there can be ZERO I/O occurring. You’d want to make sure all your connected hosts are offline, and no software connection exists to the SAN.
Anyways, the firmware update completed successfully. Now it was time to configure the unit and start playing. I read through a couple quick documents on where to get started. If I did this right the first time, I wouldn’t have to bother doing it again.
I used the wizards available to first configure the actually storage, and then provisioning and mapping to the hosts. When deploying a SAN, you should always write down and create a map of your Storage area Network topology. It helps when it comes time to configure, and really helps with reducing mistakes in the configuration. I quickly jaunted down the IP configuration for the various ports on each controller, the IPs I was going to assign to the NICs on the servers, and drew out a quick diagram as to how things will connect.
Since the MSA 2040 is a Dual Controller SAN, you want to make sure that each host can at least directly access both controllers. Therefore, in my configuration with a NIC with 2 ports, port 1 on the NIC would connect to a port on controller A of the SAN, while port 2 would connect to controller B on the SAN. When you do this and configure all the software properly (VMWare in my case), you can create a configuration that allows load balancing and fault tolerance. Keep in mind that in the Active/Active design of the MSA 2040, a controller has ownership of their configured vDisk. Most I/O will go through only to the main controller configured for that vDisk, but in the event the controller goes down, it will jump over to the other controller and I/O will proceed uninterrupted until your resolve the fault.
First part, I had to run the configuration wizard and set the various environment settings. This includes time, management port settings, unit names, friendly names, and most importantly host connection settings. I configured all the host ports for iSCSI and set the applicable IP addresses that I created in my SAN topology document in the above paragraph. Although the host ports can sit on the same subnets, it is best practice to use multiple subnets.
Jumping in to the storage provisioning wizard, I decided to create 2 separate RAID 5 arrays. The first array contains disks 1 to 12 (and while I have controller ownership set to auto, it will be assigned to controller A), and the second array contains disk 13 to 24 (again ownership is set to auto, but it will be assigned to controller B). After this, I assigned the LUN numbers, and then mapped the LUNs to all ports on the MSA 2040, ultimately allowing access to both iSCSI targets (and RAID volumes) to any port.
I’m now sitting here thinking “This was too easy”. And it turns out it was just that easy! The RAID volumes started to initialize.
VMware vSphere Configuration
At this point, I jumped on to my vSphere demo environment and configured the vDistributed iSCSI switches. I mapped the various uplinks to the various portgroups, confirmed that there was hardware link connectivity. I jumped in to the software iSCSI imitator, typed in the discovery IP, and BAM! The iSCSI initiator found all available paths, and both RAID disks I configured. Did this for the other host as well, connected to the iSCSI target, formatted the volumes as VMFS and I was done!
I’m still shocked that such a high performance and powerful unit was this easy to configure and get running. I’ve had it running for 24 hours now and have had no problems. This DESTROYS my old storage configuration in performance, thankfully I can keep my old setup for a vDP (VMWare Data Protection) instance.
HPE MSA 2040 Pictures
I’ve attached some pics below. I have to apologize for how ghetto the images/setup is. Keep in mind this is a test demo environment for showcasing the technologies and their capabilities.
Update: HPE has updated the MSA product line and the 2040 has now been replaced by the HPE MSA 2050 SAN Dual Controller SAN. There are now also SSD Cache models such as the HPE MSA 2052 Dual Controller SAN.
Privacy & Cookies Policy