RAID Double Faults

Double Faults

Any RAID system is designed around the concept of using redundancy to provide safety. Other than RAID level 1 (mirrors), they do this by striping N data bits and 1 parity bit across N+1 drives. If you lose a drive, you're fine, because you still have all the data; replace the failed drive, and the RAID controller or host software will recalculate the missing bits -- this is the rebuild process. Indeed, this is why most RAIDs have hot spares in them, so that the rebuild can begin immediately, without waiting for a human to swap out the failed drive. If you lose a second drive before or during the rebuild, you're dead meat -- you lose all the data on the entire RAID set.

RAID manufacturers will tell you that such a double fault is extremely unlikely, since the high mean times between failures of modern disk drives makes it highly improbable that a second drive will fail during that brief interval. It is mathematically unlikely that you would ever see a double fault during your entire career.

I have seen it happen ten times.

The good news is that none of those ten times were real, and that we were able to recover of the data, every single time.

The manufacturers are not lying. Indeed, I want to emphasize that although I am naming RAID controllers in this document, I have no beef with any of these controllers or their manufacturers. They have all performed extremely well, operating without a hitch for years on end. It has been my observation that glitches occur much more often in the parts of the RAID external to the controller (or host) such as the disk trays or cables or ancillary bits. Bear this in mind if you are tempted to "roll your own" RAID or to buy from the cheapest vendor without regard to quality of the subsidiary mechanical and electrical parts.

Mylex DAC-960

In most cases, double-faults on DAC-960 systems seem to be the result of momentary tray failures at powerup or a glitch, and the controller probes quickly, before the drives are once again ready. I have seen as many as ten "failed" drives in one of these RAIDs.

Whether it is this simple case of a power hit or the like confusing the tray electronics or the controller or whatever it is, I have just gone into the Configuration menu and changed all the drives listed as D (for dead) back to O for online. This is a case where keeping track of the spare disk can really pay off, because you must make sure that if it's one of the D drives, not to mark it as O, but rather as S. I've had to do this mass re-statusing a few times, and it has always worked fine.

I've had an instance where a drive failed, and during the rebuild, another drive "failed". In the DAC-960's Toolkit menu, the original spare will now be part of the RAID set, the original failed drive will be marked D and not part of a RAID set, and the second failed drive will be marked D but will still be part of the RAID set. In this situation, I was able to use the Configuration menu to make the second failed drive as O for Online, then rebooted the host and copy the data off. There were some real errors on the drive, so I had the folks doing the actual copying avoid those files until last, then copy them one at a time to avoid causing many errors. The data did all come back uncorrupted. This may actually have been a "real" double fault, but fortunately we were able to recover.

CMD CRD-5500

This was a scary one. It failed a drive and rebuilt just fine. I pulled the failed drive, and inserted a replacement. While I was in the menus, getting to where you mark the new drive as a hot spare, all the "0"s (raid set number) turned to "NA" and the beeping began. It turned out that the replacement drive had the terminator jumpers set wrong which fouled up the two other drives on its SCSI bus. All was well after pulling that drive. It was a bit confusing because we had recently been recabling, and because in our setup the drives on the same bus were not physically neighbors. Since the failure doesn't say "Hey, I can't see bus #3" but just drops the whole RAID set, you don't realize at first that you lost a SCSI bus. It takes probing each drive using the menus to realize that there are three offline, and that they're on the same bus.

Ciprico 6700

After a power hit, we had a Ciprico 6700 come back with a dead drive. The rebuild ran into problems, saying that another drive was having lots of errors. Ultimately we ended up just telling it to continue the rebuild; unfortunately, the Ciprico GUI popped up a little error Continue? Yes/No box for every error. We had to click Yes every few seconds in shifts around the clock for a couple of days. However, it did rebuild. We then marked the second drive as bad, pulled it, and rebuilt onto its replacement without incident.

Network Appliance

These are nice boxes -- when a drive fails, they light up an obvious "failed" LED, and even send email to Network Appliance! (If you have them set up to.)

Watch out for the environmental monitoring units in the "new" trays in the F540 (and probably other) models. If they are not hooked up right (I didn't do it!), the wrong "failed drive" red LED will light, leading you to pull the wrong drive. If you do this during the rebuild, bingo! double-fault. I had a problem in that shutting down, replacing the wrongly-pulled drive and rebooting didn't fix it; the original failed drive was then whacking out its bus in some way and the whole RAID set went away. I then pulled the (real) original failed drive, rebooted, and the RAID set came back.

On a NetApp F330, I've also seen a tray failure lead to a high disk failure rate, indeed, failing drives so fast that each rebuild could barely finish. We ended up copying data off rather frantically, and getting NetApp to fly us a replacement tray. It never actually double-faulted, but it came close. It was an example of how record-keeping helps, because after about the third failure, we were seeing the all-in-one-tray pattern.

Sun SSA-210

Your basic tray failure. It nailed one drive in each half of a mirror, so that wasn't good, and a single-drive filesystem to boot. After a tray replacement, only the first "failed" drive was still marked bad. I did have to call Sun with a question about software configuration; the Veritas volume manager showed one of the spares in a bit of an odd state, but it was actually fine.

Conclusion

I want to emphasize a few things. First and foremost is backups. We actually had backups in almost all cases I've cited, though sometimes a week or more old. Nonetheless, when you're looking at 250GB of dead RAID, it is unbelievably comforting to know that if you aren't successful at resuscitating it, that after considerable spinning of tapes, your users' data will be back.

Second, don't take too much hope from this document. If your RAID claims to be double-faulted or dead, I suspect it probably is. I claim only extreme luck, not skill. Even if you appear to have revived it, your filesystem or data may be corrupt.

Realize also that none of these procedures are approved by the manufacturers. They are smart people, listen to them. If you have support on your hardware, call them. Don't start screwing around with this stuff if you don't absolutely have to, AND if you have given up on the data. You are very likely to make things worse, and possibly cause hardware damage.

Good recordkeeping is critical. Always make notes about drive failures: which controller and ID they are, and in which tray. Look out for patterns. I have seen a tray failure start slow and get worse, and noticing the pattern made all the difference. We like to mark the current spare drive(s) with a small sticker, and when a drive fails, to mark it "failed" as well (while rebuilding), and double-check it before pulling it. It's often possible to help verify which drive failed by watching the lights during the rebuild, since all the live data disks and one spare will be active, and one data (now failed) drive will be dark. If this quiet drive isn't the one you think failed, check again.

Keep the manuals for all your RAIDs. Keep the service telephone numbers in a findable place, and make sure you have whatever serial numbers or PO numbers you might need to make a service call.

When you're talking to service reps on the phone, take it easy on them. A double-fault is a panic-inducing situation, and you are likely to be impaired both in judgement and in interpersonal skills. This is not only hard on other people, but it imperils your data. Calm down. Turn off the audible alarm, step outside the noisy computer room, and take a breath. If you need to eat something, do so -- ten extra minutes without their probably-dead data will not kill your users at this point, and real food can really clear your head. Unfortunately, management can seem to like seeing you panic; they sometimes seem to prefer frantic running-around during a crisis to seeing you sit down and eat a sandwich, RAID controller manual propped up on the table! ("Why don't you already know how to do this?"). I have no FAQ about how to fix management!

I make no warranty as to the accuracy or helpfulness of this document. In fact, since it's written from my memory of these events, I want to emphasize that it is probably somewhat inaccurate -- these were ten very stressful situations for me! I have deliberately *not* attempted to document the RAID controllers or their interfaces, just to indicate the general situations I encountered and approximately what I tried. You are entirely responsible for your actions. These are extremely dangerous procedures, to be attempted only in extreme situations.

Good luck!

My thanks to J and to M; you know who you are.