Nov 222012
 

Just something I wanted to share in case anyone else ran in to this issue…

At a specific client we have 2 X MSA60 units attached via Smart Array P800 controllers to 2 X DL360 G6 servers. These combo of server, controller, and storage units were purchased just after they were originally released from HP.

I’m writing about a specific condition in which after a drive fails in RAID 5, during rebuild, numerous (and I mean over 70,000) event log entries in the event viewer state: “Surface analysis has repaired an inconsistent stripe on logical drive 1 connected to array controller P800 located in server slot 2. This repair was conducted by updating the parity data to match the data drive contents.”

 

One one of these arrays, shortly after a successful rebuild while the event viewer was spitting these errors out, had another drive fail. At this point the RAID array went offline, and the entire RAID array and all it’s contents were unrecoverable. Keep in mind this occurred after the rebuild, while a surface scan was in progress. In this specific case we rebuilt the array, restored from backup and all was good. After mentioning this to HP support techs, they said it was safe to ignore these messages as they were fine and informational (I didn’t feel this was the case). After creating the new RAID array on this specific unit, we never saw these messages on that unit again.

On the other MSA60 unit however, we regularly received these messages (we always keep the firmware of the MSA60 unit, and the P800 controller up to date). Again numerous times asked HP support and they said we could safely ignore these. Recently, during a power outage, the P800 controller flagged it’s cache batteries as failed, at the same time a drive failed and we were yet again presented with these errors after the rebuild. After getting the drive replaced, I contacted HP again, and finally insisted that they investigate this issue regarding the event log errors. This specific time, new errors about parity were presenting themselves in the event viewer.

After being put on hold for some time, they came back and mentioned that these errors are probably caused because the RAID array was created with a very early firmware version. They recommended to delete the logical array, and re-create it with the latest firmware to avoid any data loss. I specifically asked if there was a chance that the array could fail due to these errors, and the fact it was created with an early firmware version, and they confirmed it. I went ahead, created backups, deleted the array and re-created it, restored the back and the errors are no longer present.

 

I just wanted to create this blog post, as I see numerous people are searching for the meaning of these errors, and wanted to shed some light and maybe help a few of you out, to help you avoid any future catastrophic problems!

  2 Responses to “MSA60 during rebuild: “Surface analysis has repaired an inconsistent stripe on logical drive 1 connected to array controller P800 located in server slot 2. This repair was conducted by updating the parity data to match the data drive contents.””

  1. Hi Stephen,

    I’ve read some of your blogs about MS60, I think perhaps you could advise me on the problem I’m experiencing. We have at work one MSA60 with 6 750 GB SATA disks on RAID5 (5 disks + 1 HotSpare). Recently MSA60 detected 5 bad blocks on disk 3 so I decided to fail that disk and rebuild the RAID over disk 2 that was hot spare. The rebuild succeed but after several minutes became degraded again and many bad blocks (lot more that in disk 3) were reported on disk 2. So I decided to set disk 3 as Hot Spare and fail disk 2, rebuild the RAID again and succeed then I ordered to our usual supplier a replacement for disk 2 and 3 and they have sent two 1TB SATA disk. I’ve unplugged the disk 2 and replace with a new one, set disk 2 as Hot Spare and then set disk 3 to failed, automatic rebuild started but after 8 hours failed and then disk 2 and 3 disappeared from Storage Manager. So I restarted the server and after that both disks appeared again and automatic rebuild started again, but again after few hours failed and same thing with disappearing disks 2 (new) and 3.

    MSA60 connected via Adaptec RAID Controller 3085 to Windows 2003 Server Std.

    Hope you could help.

    Thanks

    Rafa.

  2. Hi Rafa,

    First and foremost, I would have to say that I recommend only using HP SmartArray controllers (that are “supported”) with the HP MSA60.

    With RAID5 + hot spare, the hot spares should always be the last disks of array. For example, with 6 disks (1 being a hot spare), the RAID data should be on disks 1 through 5, with disk 6 being a hot spare.

    I’m somewhat confused about how you’re mentioning that you’re actually “setting” numerous disks as hot spares as this doesn’t really make sense.

    Typically in a RAID 5 + hot spare setup. If a disk in the array were to fail, the hot spare (in this case disk 6) would initialize and rebuild the data from the damaged disk on to the hot spare. This would re-establish redundancy and protection. One would then replace the failed disk, the array would rebuild the data that should be on the disk that failed, then the hot spare would be de-activated and made available again for another failure event. There should be no manual setting of hot spares once the array is created, unless of course you’re adding more hot spares, or doing some type of maintenance on the array.

    If you could clarify on that, I may be able to offer some assistance, however I would recommend backing up, getting your hands on a “supported” SmartArray controller, and creating a new RAID array, then restoring the data back on to the MSA60.

    Stephen

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)