crash and data loss on supermicro xdwn7+/xeon5420/adaptec 52445/lvm/ext4
I reported this as a bug to bugzilla.kernel.org as #16081, but since
we're running Debian I thought asking around for help and discussion
would be advisable.
The current situation is:
We have a Supermicro XDWN7+ board (Intel 5400, Xeon 5420 CPU, 8GB Ram)
with 24 1TB SATA disks attached to an Adaptec 52445 controller and a
Tandberg Tape-library attached to LSI SAS1068E SAS controller. The
system runs on Lenny with a recompiled (no options changed) Linux
We use bacula 5.0.2 as our backup software (backported to lenny) and so
far it works quite well. The only problem is: After writing around 10TiB
of data to the disks, the machine crashes. This happened two times, and
after the second time both filesystems containing the backup-diskpool
(9TiB LVM-Volumes with ext4 filesystems) were completely garbled. One fs
now looks like this:
The other one is not mountable anymore:
[88397.252831] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for
group 1 failed (49189!=48621)
[88397.252856] EXT4-fs (dm-1): group descriptors corrupted!
One thing to note is that using Supermicros current BIOS 1.2b for this
board, the machine crashes after a fair amount of network and disk-io
(around 2-5TiB I believe) with an MCE. This does not happen with their
BIOS version 1.1b which is installed at the moment.
I'm at a loss here, as I really don't know what's causing these crashes
and also don't really know how I can debug this any further. Does
anybody have any hints for me?
memtest86 runs fine for hours, by the way, and the machine doesn't have
heat problems (at least the IPMI-console doesn't say so, and the fans
are all fine).
More info on the system can be found at
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact firstname.lastname@example.org