Feb 202013
 

Recently it was time to refresh a client’s disaster recovery solution. We were getting ready to release our dependance on our 5 year old HP MSL2024 with an LTO-4 tape drive, and implement a new HP MSL2024 library with a SAS LTO-6 tape drive. We need to use tape since the size of the backup requirements for a full back up are over 6TB.

The server that is connected to all this equipment is an HP Proliant DL360 G6 with a HP Smart Array P800 Controller. The P800 already has an HP StorageWorks MSA60 unit attached to it with 12 drive

Documentation for the P800 mentioned tape drive support. While I know that the P800 is only capable of 3Gb/sec, this is more that enough and chances are the hard drive will be maxed out reading anyways.

Anyways, client approved purchase, brought in the hardware and installed it. First we had to install Backup Exec 2012 (since only the 2012 SP1a HCL specifies support for LTO-6), which was messy but we did it. Then we re-configured all of our backup jobs, since the old jobs were migrated horribly.

When trying to run our first backup, the backup failed. I tried again numerous times, only to get these errors:

  • Storage device “HP 07” reported an error on a request to rewind the media.
  • Final error: 0xe00084f0 – The device timed out.
  • Storage device “HP 07” reported an error on a request to write data to media.
  • Storage device “HP 6” reported an error on a request to write data to media.
  • PvlDrive::DisableAccess() – ReserveDevice failed, offline device
  • ERROR = 0x0000001F (ERROR_GEN_FAILURE)

Also, every time the backup would fail, the Library and the Tape drive would disappear from the computers “Device Manager”. Essentially the device would lose it’s connection. Even when logging in to the HP MSL2024 web interface, it would state the SAS port is disconnected after a backup job would fail. To resolve this, you’d have to restart the library and restart the Backup Exec services. One interesting thing, when this occurred, my companies monitoring and management software would report a RAID failure had occured at the customers site, until the MSL was restarted (this was kinda cool).

 

I immediately called HP support. They mentioned the library had a firmware up 5.80 and asked to try to update. We did and it failed since the firmware file didn’t match it’s checksum, I was told that this is not important as 5.90 doesn’t contain any major changes. We continued to spend 6 hours on the phone trying to disable insight agents, check drivers, etc… Finally he decided to replace the tape drive.

Since LTO-6 is brand new technology, even with a 4 hour response contract, it took HP around 2 weeks to replace the drive since none were in Canada. During this time, I called two other separate times. The second tech told me that at the moment, no HP controllers support the HP LTO-6 tape drives (you’re kidding me right?), and the 3rd said he couldn’t provide me any information as there’s nothing in the documentation that specifies what controllers were compatible. All 3 tech’s mentioned that having the P800 controller in the server host both the MSA60 and the MSL2024 is probably causing the issues.

We received the new tape drive, tested, and the backups failed. I sent the drive back (which was a repaired unit, and kept the original brand new one). After this I tried numerous things, google’d for days. Finally I was just about to quote the client a new controller card, when I finally decided to give HP another call.

On this call, he escalated the issue to engineers. Later that night I received an e-mail stating that library firmware 5.90 is required for support for the LTO-6 tape drives. I was shocked, angry, etc… It turns out that library firmware 5.80 was “Recalled” due to major issues a while back.

Since LTT couldn’t load the firmware, I just downloaded it manually and flashed it via the MSL 2024 web interface. After this restarted the Backup Exec services, performed an inventory, and did a minor backup (around 130GB). Keep in mind that when the backups originally failed, it didn’t matter the size, the backup would simply fail just before it completed.

The backup completed! Later on that night I ran a full complete backup of 5TB (2 servers and 2 MSA60s) and it completely 100% successfully. Even with the MSA60 under extreme load maxing out the drives, this did not in any way impede performance of the LTO-6 tape drive/library.

 

So please, if you’re having this issue consider the following:

1. Tape library must be at firmware version 5.90 to support LTO-6 Tape drives. Always always always make sure you have the latest firmware.

2. I have a working configuration of a P800 controlling both an HP MSA60, and a HP MSL 2024 backup library and it’s working 100%

3. Make sure you have Backup Exec 2012 SP1a installed as it’s required for LTO-6 compatibility (make sure you read about the major changes upgrading to 2012 first, I can’t stress this enough!!!)

 

I hope this helps some of you out there as this was consuming my life for numerous weeks.

Nov 222012
 

Just something I wanted to share in case anyone else ran in to this issue…

At a specific client we have 2 X MSA60 units attached via Smart Array P800 controllers to 2 X DL360 G6 servers. These combo of server, controller, and storage units were purchased just after they were originally released from HP.

I’m writing about a specific condition in which after a drive fails in RAID 5, during rebuild, numerous (and I mean over 70,000) event log entries in the event viewer state: “Surface analysis has repaired an inconsistent stripe on logical drive 1 connected to array controller P800 located in server slot 2. This repair was conducted by updating the parity data to match the data drive contents.”

 

One one of these arrays, shortly after a successful rebuild while the event viewer was spitting these errors out, had another drive fail. At this point the RAID array went offline, and the entire RAID array and all it’s contents were unrecoverable. Keep in mind this occurred after the rebuild, while a surface scan was in progress. In this specific case we rebuilt the array, restored from backup and all was good. After mentioning this to HP support techs, they said it was safe to ignore these messages as they were fine and informational (I didn’t feel this was the case). After creating the new RAID array on this specific unit, we never saw these messages on that unit again.

On the other MSA60 unit however, we regularly received these messages (we always keep the firmware of the MSA60 unit, and the P800 controller up to date). Again numerous times asked HP support and they said we could safely ignore these. Recently, during a power outage, the P800 controller flagged it’s cache batteries as failed, at the same time a drive failed and we were yet again presented with these errors after the rebuild. After getting the drive replaced, I contacted HP again, and finally insisted that they investigate this issue regarding the event log errors. This specific time, new errors about parity were presenting themselves in the event viewer.

After being put on hold for some time, they came back and mentioned that these errors are probably caused because the RAID array was created with a very early firmware version. They recommended to delete the logical array, and re-create it with the latest firmware to avoid any data loss. I specifically asked if there was a chance that the array could fail due to these errors, and the fact it was created with an early firmware version, and they confirmed it. I went ahead, created backups, deleted the array and re-created it, restored the back and the errors are no longer present.

 

I just wanted to create this blog post, as I see numerous people are searching for the meaning of these errors, and wanted to shed some light and maybe help a few of you out, to help you avoid any future catastrophic problems!

Jun 032012
 

Well, for the longest time I have been running a vSphere 4.x cluster (1 X ML350 G5, 2 X DL360 G5) off of a pair of HP MSA20’s connected to a SuperMicro server running Lio-Target as a iSCSI Target on CentOS 6.

This configuration has worked perfectly for me for almost a year, requiring absolutely no maintenance/work at all.

Recently I moved, so I had to move all my servers, storage units, etc… When I got in to the new place, and went to power everything up, I noticed that my first drive had failed upon initializing one of the MSA20 units.. I replaced this drive, let it rebuild, and thought this would be the end of the issue, but I was incorrect. (Just so everyone knows, these units had been on continuously for 8+ months before turning them off to move).

For months since this happened and a successful rebuild, at times of high I/O (backing up to a NFS share using GhettoVCB), the logical drive in the array just disappears. I have each MSA20 connected to it’s own HP SmartArray 6400/6402 controller. When the logical drive disappears, I notice that the “Drive Failure” LED on the SA 640x controller illuminates. When this happens, I have to shut off all physical servers, the storage server (running Lio-Target), and the MSA’s, and restart everything.

Sometimes it is worse than others (example I’ve been dealing with this issue non-stop all weekend with no sleep). Even under low I/O I’ll be starting the VMs and it will just lose it again. While other times, I can run it for weeks as long as I keep the I/O to a minimum.

I’ve read numerous other articles and posts of other people having the same issue. These people have had HP replace every single item inside of the MSA20 unit (except the drives) and the issue still occurs. This made me start thinking.

This weekend, I NEEDED to get a backup done. While doing this, the issue occurred, and it got to the point where I couldn’t even start the VMs, let alone back them up. I figured that since other people have had this issue, and since replacing all hardware hasn’t fixed it (I even moved a RAID array from one MSA20 to the other I have with no affect), then it has to be the drives themselves.

There are two possibilities, either the drives are failed, and the MSA20 isn’t reporting them as failed (which I’ve seen happen), or the way the MSA20 creates the RAID array has issues. I ran a ADU report and carefully read the entire report. Absolutely no issues reading/writing to the drives, and the MSA20 has no events in it’s log. It HAS to be the way the RAID array is created on the disks.

Please do not try this unless you know exactly what you’re doing. This is very dangerous and you can lose all your data. This is just me reporting on my experience and usage.

In desperation I thought to myself this all happened when a drive failed and I put a new disk in the array. Since I couldn’t back any of my data up, or let alone even start the VMs, I decided to start behaving dangerously. I kept all my vSphere hosts offline, and just turned on the MSA20 units and the SuperMicro server that is attached to them. I then proceeded to remove a drive, re-insert it, let it rebuild over 3 hours, then when healthy and rebuilt, do it to the next drive. I have a RAID 5 Array containing 4 X 500GB disks, so this actually took me a day to do (had to remove/re-insert, rebuild, then next drive).

After finally removing and rebuilding each drive in the array, I finally decided to boot up the vSphere servers, and run a backup of my 12 VMs. Not only did everything seem faster, but I completed the backup without any problems. This shows that there’s a very high chance that the RAID stuff on the drives is either corrupt, damaged, or just wasn’t implemented very nicely. Rebuilding each drive seemed to fix this. I’ll report in a few weeks to let you know for sure that it’s resolved!

 

Hope this helps someone out there!

Jan 262012
 

Well, for all you people out there considering extending your MSA20 RAID Array or transforming the RAID type, but are concerned about how long it will take…

I recently added a 250GB drive to a RAID5 array consisting of 9 X 250GB disks. Adding another 250GB disk to the RAID 5 array, took less then 8 hours (it actually could have been WAY less) to add the drive. Extending the logical partition took no time at all.

One thing I do have to caution though, I did a test transformation converting a RAID 5 array to a RAID 6. It started off going fast, once it hit 25% it sat there, only increasing 1% every 1-2 days. After 4 days I finally killed the transformation. PLEASE NOTE: There is a chance this may have had to do with a damaged drive, and I think that may have had something to do with the issue. This will need further testing. Also, just so you are aware, you CANNOT cancel a transformation. I stopped mine by simply turning off the unit, and ALL data was destroyed. If you start a transformation, you NEED to let it complete.

ALWAYS insure you have a COMPLETE backup before doing these types of things to a RAID array!

Sep 222010
 

Just a note for you guys. I was searching for this and it was incredibly hard to find out any information, so I thought I’d create something that could be easily indexed when searched for.

Symptoms:

You try to upgrade the processor on a DL360 G5 to a Quad-Core processor and you receive the message:

“The revision of the Intel (R) 5000 series chipset on the system board does not support the installed processor type. System halted!”

Cause:

There are two separate motherboards that were being shipped with the DL360 G5. The first version of the motherboard only supported 50xx and 51xx dual core Xeon processors. This version shipped with the earlier products.

The second version supported 54xx quad core Xeon processors.

Part Numbers:

412199-001 – Only supports 50xx and 51xx Dual-Core

436066-001 – Supports up to 54xx Quad Core Processors.

435949-001 – (Not Confirmed) Supports up to 54xx Quad Core Processors. (Updated March 22nd, 2011)