RAID Woes: A Tale of Sector Repairs and Drive Failures

I am running a 9505 16 port 3ware card with 12 Western Digital 300 GB drives. In /var/log/messages I found
Nov 4 10:48:53 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23268EDA. Nov 4 10:50:19 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x21D88B0F. Nov 4 10:51:27 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x1E71DB4A. Nov 4 10:51:33 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23289645. Nov 4 11:02:03 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x2111C31E. Nov 4 11:06:00 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x2219A8F3. Nov 4 11:08:52 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x1437A499. Nov 4 11:09:05 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23455701. Nov 4 11:09:10 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23455749. Nov 4 11:10:02 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x241F28D3. Nov 4 11:11:54 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x20B6CCC9. Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x22277DFD. Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x22277D80. Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=11.

in tw_cli the drive on port 11 got reported as failed:
p11 DEVICE-ERROR u0 298.09 GB 625142448 WD-WCAPD3118453

I tried to test the drive via
/usr/sbin/smartctl -t long -d 3ware,11 /dev/twa0 /usr/sbin/smartctl -t offline -d 3ware,11 /dev/twa0 /usr/sbin/smartctl -t conveyance -d 3ware,11 /dev/twa0 /usr/sbin/smartctl -t short -d 3ware,11 /dev/twa0

But
/usr/sbin/smartctl -a -d 3ware,11 /dev/twa0

did not show many good signs:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 6702 396331771 # 2 Conveyance offline Completed: read failure 90% 6702 396331771 # 3 Extended offline Completed: read failure 90% 6695 396331771 # 4 Extended offline Completed: read failure 90% 6695 396331771 # 5 Extended offline Completed: read failure 90% 6694 396331773 # 6 Extended offline Completed: read failure 70% 6684 99434677

Values like Multi_Zone_Error_Rate and Offline_Uncorrectable as well as Current_Pending_Sector promised nothing good.

In tw_cli I then removed the drive in question from the unit:
maint remove c0 p11

The smarctl tests still failed right away. I rescaned the drives in tw_cli:
maint rescan c0

The failed drive was found, and soon after the 3ware controller grabbed it automatically and started the rebuild.

It did so sucessfully. The Current_Pending_Sector value decreased back to 0, and the drive array seems to be functioning
normal right now.

During one of those pesky spurious rebuilds happening on both 9550SX-16ML controllers that I am aware
of the drive failed again. This time with an ECC-ERROR . Not enough of a failure it seems that the rebuild would have
failed. A
maint rescan c0
in tw_cli after the rebuild had finished cleared this error. It’s noteworthy that the spurious rebuild performance came to a grinding
slowdown after the ECC-ERROR.

When I replaced the failed drive things went back to normal and the system has been fine ever since.