Well, recently one of the servers I monitor and maintain in a remote oil town recently started throwing out a Windows event log warning:
Event ID: 129
Description: Reset to device, \Device\RaidPort0, was issued.
The server is an HP ML350p Gen8 (Windows Server 2008 R2) running latest firmware and management software. It has 2 RAID Arrays (RAID1, and RAID5), and a total of 6 disks.
Researching this error, I read that most people had this occur when running the latest HP WBEM providers, as well as anti-virus software. In our case, I actually tried to downgrade to an older version, but noticed the warning still occurs. While we do have anti-virus, it’s not actively scanning (only weekly scheduled scans).
In the process of troubleshooting, I noticed that under the HP Systems Management Homepage, one of the drives in the RAID1 array, had the following stats:
Hard Read Erros: 150
Recovery Read Errors: 7
Total Seeks: 0
Seek Errors: 0
I found these numbers to be very high in my experience. None of the other drives had anything close to this (in 4 years of running, only one other disk had a read error (a single one), this disk however had tons. For some reason the drive is still reporting as operational, when I’d expect it to be marked as a predicted failure, or failed.
While all online documentation was pointing towards at locks on the array by software, from my own experience I think it was actually the array waiting for a read operation on the array, and it was this single disk that was causing a threshold to be hit in the driver, that caused a retry to recover the read operation.
Called up HPE support, I mentioned I’d like to have the drive replaced. The support engineer consulted her senior engineer and reviewed the evidence I presented (along with ADU reports, and Active Monitoring health reports), the senior engineer concurred that the drive should be replaced.
Replacing the drive resolved the issue. I’m also noticing a performance increase on the array as well.
Make sure to always check the stats on the individual components of your RAID arrays, even if everything is operating sound.
I found this to be the issue and solution.
Since the time of this original post, I’ve discovered that this is a very generic error/warning, and can be caused by a whole bunch of different issues.
Essentially it’s a warning that something isn’t right with the storage system. It could be hardware failure, hardware misconfiguration, issues with systems that the hardware uses (such as the issue you posted in your comment), failed drives, back planes, even malfunctioning drivers…
I had this similar error come up on a Windows 2012 terminal server and it had me stumped for days. File explorer was slow and the RDP client took minutes to load and would ultimately fail. I went and reviewed the Windows system logs for the very first instance of this warning and right before the very first one was an Event ID 7 error “CdRom0, has a bad block”. I initially ignored this error thinking that it was harmless. When I noticed that error on second viewing a couple days later, I went to the server and ejected a disc that had been in the server for some time. Upon doing that the 129 errors stopped and the system responded as expected.
First time I’ve ever solved a problem of that nature simply by ejecting a CD from the CD ROM drive.
Hopefully this helps some people who were as lost as I was.
thanks for the heads-up! one drive had 18000 read & 4 writes & 2 rebuilds, but was still listed as “OK” in the SSA/iLo. Crazy!!