Well, for the longest time I have been running a vSphere 4.x cluster (1 X ML350 G5, 2 X DL360 G5) off of a pair of HP MSA20’s connected to a SuperMicro server running Lio-Target as a iSCSI Target on CentOS 6.
This configuration has worked perfectly for me for almost a year, requiring absolutely no maintenance/work at all.
Recently I moved, so I had to move all my servers, storage units, etc… When I got in to the new place, and went to power everything up, I noticed that my first drive had failed upon initializing one of the MSA20 units.. I replaced this drive, let it rebuild, and thought this would be the end of the issue, but I was incorrect. (Just so everyone knows, these units had been on continuously for 8+ months before turning them off to move).
For months since this happened and a successful rebuild, at times of high I/O (backing up to a NFS share using GhettoVCB), the logical drive in the array just disappears. I have each MSA20 connected to it’s own HP SmartArray 6400/6402 controller. When the logical drive disappears, I notice that the “Drive Failure” LED on the SA 640x controller illuminates. When this happens, I have to shut off all physical servers, the storage server (running Lio-Target), and the MSA’s, and restart everything.
Sometimes it is worse than others (example I’ve been dealing with this issue non-stop all weekend with no sleep). Even under low I/O I’ll be starting the VMs and it will just lose it again. While other times, I can run it for weeks as long as I keep the I/O to a minimum.
I’ve read numerous other articles and posts of other people having the same issue. These people have had HP replace every single item inside of the MSA20 unit (except the drives) and the issue still occurs. This made me start thinking.
This weekend, I NEEDED to get a backup done. While doing this, the issue occurred, and it got to the point where I couldn’t even start the VMs, let alone back them up. I figured that since other people have had this issue, and since replacing all hardware hasn’t fixed it (I even moved a RAID array from one MSA20 to the other I have with no affect), then it has to be the drives themselves.
There are two possibilities, either the drives are failed, and the MSA20 isn’t reporting them as failed (which I’ve seen happen), or the way the MSA20 creates the RAID array has issues. I ran a ADU report and carefully read the entire report. Absolutely no issues reading/writing to the drives, and the MSA20 has no events in it’s log. It HAS to be the way the RAID array is created on the disks.
Please do not try this unless you know exactly what you’re doing. This is very dangerous and you can lose all your data. This is just me reporting on my experience and usage.
In desperation I thought to myself this all happened when a drive failed and I put a new disk in the array. Since I couldn’t back any of my data up, or let alone even start the VMs, I decided to start behaving dangerously. I kept all my vSphere hosts offline, and just turned on the MSA20 units and the SuperMicro server that is attached to them. I then proceeded to remove a drive, re-insert it, let it rebuild over 3 hours, then when healthy and rebuilt, do it to the next drive. I have a RAID 5 Array containing 4 X 500GB disks, so this actually took me a day to do (had to remove/re-insert, rebuild, then next drive).
After finally removing and rebuilding each drive in the array, I finally decided to boot up the vSphere servers, and run a backup of my 12 VMs. Not only did everything seem faster, but I completed the backup without any problems. This shows that there’s a very high chance that the RAID stuff on the drives is either corrupt, damaged, or just wasn’t implemented very nicely. Rebuilding each drive seemed to fix this. I’ll report in a few weeks to let you know for sure that it’s resolved!
Hope this helps someone out there!