3ware 16 port RAID-5 with 300GB Western Digital

misc technology

I am running a 9505 16 port 3ware card with 12 Western Digital 300 GB drives. In /var/log/messages I found

Nov 4 10:48:53 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23268EDA.
Nov 4 10:50:19 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x21D88B0F.
Nov 4 10:51:27 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x1E71DB4A.
Nov 4 10:51:33 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23289645.
Nov 4 11:02:03 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x2111C31E.
Nov 4 11:06:00 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x2219A8F3.
Nov 4 11:08:52 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x1437A499.
Nov 4 11:09:05 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23455701.
Nov 4 11:09:10 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x23455749.
Nov 4 11:10:02 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x241F28D3.
Nov 4 11:11:54 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x20B6CCC9.
Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x22277DFD.
Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=11, LBA=0x22277D80.
Nov 4 11:12:13 her2 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, port=11.

in tw_cli the drive on port 11 got reported as failed:

p11 DEVICE-ERROR u0 298.09 GB 625142448 WD-WCAPD3118453

I tried to test the drive via

/usr/sbin/smartctl -t long -d 3ware,11 /dev/twa0
/usr/sbin/smartctl -t offline -d 3ware,11 /dev/twa0
/usr/sbin/smartctl -t conveyance -d 3ware,11 /dev/twa0
/usr/sbin/smartctl -t short -d 3ware,11 /dev/twa0

But

/usr/sbin/smartctl -a -d 3ware,11 /dev/twa0

did not show many good signs:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 6702 396331771
# 2 Conveyance offline Completed: read failure 90% 6702 396331771
# 3 Extended offline Completed: read failure 90% 6695 396331771
# 4 Extended offline Completed: read failure 90% 6695 396331771
# 5 Extended offline Completed: read failure 90% 6694 396331773
# 6 Extended offline Completed: read failure 70% 6684 99434677

Values like Multi_Zone_Error_Rate and Offline_Uncorrectable as well as Current_Pending_Sector promised nothing good.

In tw_cli I then removed the drive in question from the unit:

maint remove c0 p11

The smarctl tests still failed right away. I rescaned the drives in tw_cli:

maint rescan c0

The failed drive was found, and soon after the 3ware controller grabbed it automatically and started the rebuild.

It did so sucessfully. The Current_Pending_Sector value decreased back to 0, and the drive array seems to be functioning
normal right now.

During one of those pesky spurious rebuilds happening on both 9550SX-16ML controllers that I am aware
of the drive failed again. This time with an ECC-ERROR . Not enough of a failure it seems that the rebuild would have
failed. A

maint rescan c0

in tw_cli after the rebuild had finished cleared this error. It’s noteworthy that the spurious rebuild performance came to a grinding
slowdown after the ECC-ERROR.

When I replaced the failed drive things went back to normal and the system has been fine ever since.