delphij's Chaos


16 Mar 2005

RAID-5 Disaster on InfoTrend based TOYOU SATA disk array

And finally, it happend.

On Monday, one of our server, running FreeBSD, panic’ed with “ffs_clusteralloc: map mismatch”. This message is quite uncommon and should only happen when there is some “event” with the memory or hard disk.

I have suggested our administrator to reboot and do “fsck” on the volume, since it has very small block size and can not be checked background (I have made a change to warn about this to FreeBSD, and was MFC’ed to 5-STABLE so you will get the feature in 5.4-RELEASE). Unfortunatelly, the check is so slow and we start to suspect whether there is some hardware problem.

Then, it happend.

The SATA based InfoTrend firmware does not dropped the bad disk. It has attempted to re-allocate blocks so they are replaced with good ones. Unfortunately, our SNMP monitoring system has not watched for SNMP trap messages, which turns out to be “No warning” about the disk event.

I have decided to move the data from the bad array to an good array. This is a lesson from which we have learned that:

  1. InfoTrend’s SCSI series can drop disks remotely, and SATA ones can’t do this except you have actually pull the disk out.
  2. Because the disk array tend to relocate blocks (on SCSI series, the behavior is to drop it immediately so you are relatively safe from getting bad data), one *MUST* depend on SNMP traps and not only SNMP watches.
  3. Recovering data with “Clone” is not wise. We have spend about 2 hours cloning bad data, and now, we are still trying to save the data, since the cloned data is not guaranteed to be correct. The correct solution is to remove the bad disk and rebuild the RAID.

With inexpensive hardware, one must be more careful :-)