I recently posted the below message to linux-raid, but perhaps it should
have gone here first... Perhaps Neil Brown will have some bright ideas.
Common factors on all three pieces of hardware seeing the problem seem
to have been:
Lenny 2.6.26 kernel
md with lvm snapshots
4 or more cores
I have a box with a relatively simple setup:
sda + sdb are 1TB SATA drives attached to an Intel ICH10.
Three partitions on each drive, three md raid1s built on top of these:
md2 LVM PV
During resync about a week ago, processes seemed to deadlock on I/O, the
machine was still alive but with a load of 100+. A USB drive happened
to be mounted, so I managed to save /var/log/kern.log At the time of
the problem, the monthly RAID check was in progress. On reboot, a
rebuild commenced, and the same deadlock seemed to occur between roughly
2 minutes and 15 minutes after boot.
At this point, the server was running on a Dell PE R300 (12G RAM,
quad-core), with an LSI SAS controller and 2x 500G SATA drives. I
shifted all the data onto a spare box (Dell PE R210, ICH10R, 1x1TB
drive, 8G RAM, quad-core+HT), with only a single drive, so I created the
md RAID1s with just a single drive in each. The original box was put
offline with the idea of me debugging it "soon".
This morning, I added in a second 1TB drive, and during the resync
(approx 1 hour in), the deadlock up occurred again. The resync had
stopped, and any attempt to write to md2 would deadlock the process in
question. I think it was doing an rsnaphot backup to a USB drive at the
time the initial problem occurred - this creates an LVM snapshot device
on top of md2 for the duration of the backup for each filesystem backed
up (there are two at the moment), and I suppose this results in lots of
read-copy-update operations - the mounting of the snapshots shows up in
the logs as the fs-mounts, and subsequent orphan_cleanups. As the
snapshot survives the reboot, I assume this is what triggers the
subsequent lockup after the machine has rebooted.
I got a couple of 'echo w > /proc/sysrq-trigger' sets of output this
time... Edited copies of kern.log are attached - looks like it's
barrier related. I'd guess the combination of the LVM CoW snapshot, and
the RAID resync are tickling this bug.
Any thoughts? Maybe this is related to Debian bug #584881 -
... since the kernel is essentially the same.
I can do some debugging on this out-of-office-hours, or can probably
resurrect the original hardware to debug that too.
I think vger binned the first version of this email (with the logs
attached) - so apologies if you've ended up with two copies of this email...
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact firstname.lastname@example.org