ext3 file system I/O blocks until reboot
Hi all,
We have a server that has a 580GB ext3 file system on it. Until recently we ran around 15 virtual servers from this file system. It was fine for at least a few months, then the file system would periodically become inaccessible, getting more frequent as time went on. Eventually we wouldn't even get through a 15-hour period without having to reboot the server. When the I/O got blocked, all processes accessing files on /var/lib/vservers (its mount point) would get stuck waiting for I/O to complete ("D" state) and I couldn't find any way to revive it apart from rebooting the server. I tried sending various signals (TERM and KILL) to some kernel threads but that didn't help at all. The "kjournald" process also got stuck in the "D" state. The server is running kernel 2.6.22.19 with the Linux-Vserver patch vs2.2.0.7, DRBD 8.2.6 and the Areca RAID driver updated to 1.20.0X.15-80603 which was the latest available from Areca at the time. The OS is Debian etch. As part of troubleshooting the problem I'd taken DRBD out of the mix, tried updating the RAID driver in the kernel, replaced the RAID card with another one with slightly later firmware, and also replaced the power supply with a known-good one at the same time and disabled the swap space. None of that helped. What did help was copying the files from the existing file system to a newly formatted ext3 file system. The newly formatted file system is only around 320GB, but is also set up the same as the existing one (both are hardware RAID-6, running on the same host, same controller, same physical disks, etc). When the file system would become inaccessible, there were no notices from the kernel about any issue at all. We have a serial console on this server and nothing was captured by the serial console when this happened, nor is there anything in the system logs (which should have been writable all this time as they are not on the broken file system). I used 'dd' to check if I could read from the underlying device files that the file system was on (/dev/sdc1 and /dev/drbd1), there was no problem doing that. I didn't test writes to these devices though since I don't know of any safe way to do so, but using the SysRq feature, an emergency sync would not complete, nor would an emergency umount, so I assume writes were out of the question. Doing an 'ls' on /var/lib/vservers just left me with yet another process stuck in the "D" state. A forced fsck of the file system (using a fresh build of e2fsprogs 1.41.3 with the matching libraries) provides no hint of any problems. The root file system is an ext3 file system as well, and there were no problems reading/writing to that file system while the ext3 file system on /var/lib/vservers was inaccessible. The filesystem is also on the same RAID card, physical disks, etc. One reason I've not moved to a newer kernel yet is because there isn't a stable linux-vserver patch for anything newer than 2.6.22.19, so I'm kind of stuck with that kernel until there is. I made a start on backporting the ext3 code from 2.6.26.5 to 2.6.22.19 but its not something I trust myself to get right, so I'd rather avoid that approach unless there is another way of doing that. So my questions are: Are there any further diagnostics I can perform on the old file system to try and track down the problem? If so, what are they? Is this a known bug/problem with ext3 or something related to it? Is it likely that one of the 3 or so deadlocks that have been fixed in kernels since 2.6.22.19 would have cured this problem, or would these deadlocks have taken down the hole box and not just affected the one file system? Or even this bug: http://bugzilla.kernel.org/show_bug.cgi?id=10882 (the softlockup part, I think not though because I was able to copy everything off that file system and on to a new one without having any lockups or any other complaints from the kernel). Thanks. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support@obsidian.com.au _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
ext3 file system I/O blocks until reboot
On Mon, Oct 20, 2008 at 13:35:54 +1100,
Robert Davidson <rdavidson@obsidian.com.au> wrote: > > So my questions are: > > Are there any further diagnostics I can perform on the old file system > to try and track down the problem? If so, what are they? > > Is this a known bug/problem with ext3 or something related to it? I saw stuff like this happening starting with later 2.6.20 kernels that wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using VM's, so it might not be the same as the bug you are seeing. I do remember seeing some other similar problems people were having that didn't appear to be the same bug as I had when I did bugzilla searches. So you might want to do your own bugzilla search to see what you can find. I have also been getting disk IO lockups in F10, but in a more limited set of circumstances. (Memory pressure on an X86_64 system.) _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
ext3 file system I/O blocks until reboot
Bruno Wolff III wrote:
> I saw stuff like this happening starting with later 2.6.20 kernels that > wasn't fixed until the 2.6.24 kernels. (See bug 235043.) I wasn't using > VM's, so it might not be the same as the bug you are seeing. I do remember > seeing some other similar problems people were having that didn't appear > to be the same bug as I had when I did bugzilla searches. So you might > want to do your own bugzilla search to see what you can find. > > I have also been getting disk IO lockups in F10, but in a more limited set > of circumstances. (Memory pressure on an X86_64 system.) > Hi Bruno, I've had a look through bugzilla but couldn't find any similar bugs (the closest I can find is 439548 but I doubt very much that thats it). Your bug 235043 does sound rather different since it sounds like new processes would be able to access the file system without a problem, where as on my system any new attempt to read (writing wasn't tested) just resulted in one more process stuck in the "D" state. I might try taking a byte-for-byte copy of the FS and see if I can find a way to reliably re-produce the problem on a similar server. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support@obsidian.com.au _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
ext3 file system I/O blocks until reboot
On Tue, Oct 21, 2008 at 11:40:06 +1100,
Robert Davidson <rdavidson@obsidian.com.au> wrote: > > I've had a look through bugzilla but couldn't find any similar bugs (the > closest I can find is 439548 but I doubt very much that thats it). Your > bug 235043 does sound rather different since it sounds like new > processes would be able to access the file system without a problem, > where as on my system any new attempt to read (writing wasn't tested) > just resulted in one more process stuck in the "D" state. For a while. Eventually everything would lock up. _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
ext3 file system I/O blocks until reboot
Probably too late anyway, but:
On Mon, 20 Oct 2008, Robert Davidson wrote: The "kjournald" process also got stuck in the "D" state. Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t. You mentioned /var/log was on a different filesystem, so this information might make it to the disks. If not, your serial console should catch it. Maybe then we'll find out *why* these process are in "D" state. Christian. -- BOFH excuse #25: Decreasing electron flux _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
ext3 file system I/O blocks until reboot
Christian Kujau wrote:
> Probably too late anyway, but: > > On Mon, 20 Oct 2008, Robert Davidson wrote: >> The "kjournald" process also got stuck in the "D" state. > > Did you try a SysReq-w to show all blocked tasks? OR even -d, or -t. > You mentioned /var/log was on a different filesystem, so this > information might make it to the disks. If not, your serial console > should catch it. Maybe then we'll find out *why* these process are in > "D" state. Hi Christian, Not too late - this is an ongoing problem still. I'm currently trying to see if I can get some newer vserver patches so I can build a newer kernel and try that. Currently I'm stuck with 2.6.22.19 I've tried doing various SysRq requests, none of them would give me anything back on the serial console, but it seems that may have been my own fault for having the console logging set too low. I've fixed that up now. In any case, the responses you'd expect to see from the kernel for the various SysRq commands never made it into the logs. About a month ago when the server last had problems, I made a new ext3 filesystem and copied everything from the old filesystem to the new one. I thought that worked but then last night we lost the same filesystem again and had to reboot. After copying everything off the original filesystem (also ext3) I ran a forced fsck.ext3 on it and it didn't find any problems. -- Regards, Robert Davidson. Obsidian Consulting Group. Ph. 03-9355-7844 E-Mail: support@obsidian.com.au _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
| All times are GMT. The time now is 12:35 PM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.