Questions regarding journal replay
Today, I had to uncleanly shutdown one of our machines due to an error
in 2.6.28.3. Durin the boot sequence, the ext4 partition /home experienced a journal replay. /home looks like this: /dev/mapper/volg1-logv1 on /home type ext4 (rw,noexec,nodev,noatime,errors=remount-ro) Filesystem Size Used Avail Use% Mounted on /dev/mapper/volg1-logv1 2,4T 1,4T 1022G 58% /home Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/volg1-logv1 19519488 8793310 10726178 46% /home The journal replay too quite a while. About 800 seconds. # dumpe2fs -h /dev/mapper/volg1-logv1 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 032613d3-6035-4872-bc0a-11db92feec5e Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 19519488 Block count: 624605184 Reserved block count: 0 Free blocks: 267655114 Free inodes: 10726118 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 875 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 1024 Inode blocks per group: 32 Filesystem created: Tue May 8 21:04:31 2007 Last mount time: Thu Feb 5 11:08:27 2009 Last write time: Thu Feb 5 11:08:27 2009 Mount count: 12 Maximum mount count: -1 Last checked: Sat Dec 27 23:16:47 2008 Check interval: 0 (<none>) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 First orphan inode: 17529831 Default directory hash: tea Directory Hash Seed: 44337061-e542-44bb-afb9-40597ccf1c6d Journal backup: inode blocks Journal size: 128M Questions: ========== * Why does it take so long? * What happens during that time? * Is my journal maybe too big? -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
Yesterday Ralf Hildebrandt said:
The journal replay too quite a while. About 800 seconds. Were there any other background iops on the underlying volume devices? Like maybe raid reconstruction? ../C _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
* Curtis Doty <Curtis@GreenKey.net>:
> Yesterday Ralf Hildebrandt said: > >> The journal replay too quite a while. About 800 seconds. >> > > Were there any other background iops on the underlying volume > devices? Like maybe raid reconstruction? I don't think so. The machine never powered off... -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
* Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>:
> * Curtis Doty <Curtis@GreenKey.net>: > > Yesterday Ralf Hildebrandt said: > > > >> The journal replay too quite a while. About 800 seconds. > >> > > > > Were there any other background iops on the underlying volume > > devices? Like maybe raid reconstruction? > > I don't think so. The machine never powered off... Again, 2.6.28.7 failed us and now we're encountering another journal replay. Taking ages. This sucks. Questions: How can I find out (during normal operation) HOW MUCH of the journal is actually in use? How can I resize the journal to be smaller, thus making a journal replay faster? -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
Ralf Hildebrandt wrote:
> * Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>: >> * Curtis Doty <Curtis@GreenKey.net>: >>> Yesterday Ralf Hildebrandt said: >>> >>>> The journal replay too quite a while. About 800 seconds. >>>> >>> Were there any other background iops on the underlying volume >>> devices? Like maybe raid reconstruction? >> I don't think so. The machine never powered off... > > Again, 2.6.28.7 failed us and now we're encountering another journal > replay. Taking ages. This sucks. > > Questions: > > How can I find out (during normal operation) HOW MUCH of the > journal is actually in use? > > How can I resize the journal to be smaller, thus making a journal > replay faster? > It'd be better to get to the bottom of the problem ... maybe iostat while it's happening to see if IO is actually happening; run blktrace to see where IO is going, do a few sysrq-t's to see where threads are at, etc. Can you find a way to reproduce this at will? Journal replay should *never* take this long, AFAIK. -Eric _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
* Eric Sandeen <sandeen@redhat.com>:
> Ralf Hildebrandt wrote: > > * Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>: > >> * Curtis Doty <Curtis@GreenKey.net>: > >>> Yesterday Ralf Hildebrandt said: > >>> > >>>> The journal replay too quite a while. About 800 seconds. > >>>> > >>> Were there any other background iops on the underlying volume > >>> devices? Like maybe raid reconstruction? > >> I don't think so. The machine never powered off... > > > > Again, 2.6.28.7 failed us and now we're encountering another journal > > replay. Taking ages. This sucks. > > > > Questions: > > > > How can I find out (during normal operation) HOW MUCH of the > > journal is actually in use? > > > > How can I resize the journal to be smaller, thus making a journal > > replay faster? > > > > It'd be better to get to the bottom of the problem ... maybe iostat > while it's happening to see if IO is actually happening; run blktrace to > see where IO is going, do a few sysrq-t's to see where threads are at, etc. We had 24GB of reading from the journal device (or 12GB if it's 512byte blocks). I wonder why? > Can you find a way to reproduce this at will? Yes. My users will kill me, though. > Journal replay should *never* take this long, AFAIK. Amen -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
On Wed, Feb 25, 2009 at 10:31:42AM -0600, Eric Sandeen wrote:
> > It'd be better to get to the bottom of the problem ... maybe iostat > while it's happening to see if IO is actually happening; run blktrace to > see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > Can you find a way to reproduce this at will? > > Journal replay should *never* take this long, AFAIK. Indeed. The journal is 128 megs, as I recall. So even if the journal was completely full, if it's taking 800 seconds, that's a write rate of 0.16 Mb/S (164 kb/second). That is indeed way too slow. I assume this wasn't your boot partition, so the journal replay was being done by e2fsck, right? Or are you guys skipping e2fsck and the journal replay was happening when you mounted the partition? If the journal replay is happening via e2fsck, is fsck running any other filesystem checks in parallel? Also, what is the geometry of your raid? How many disks, what RAID level, and what is the chunk size? The journal replay is done a filesystem block at a time, so it could be that it's turning into a large number of read-modify-writes, which is trashing your performance if the chunk size is really large. The other thing that might explain the performan problem is if the somehow the number of multiple outstanding requests allowed by the hard drive has been clamped down to a very small number, and so a large number of small read/write requests is really killing performance. The system dmesg log might have some hidden clues about that. - Ted _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
Ralf Hildebrandt wrote:
> * Eric Sandeen <sandeen@redhat.com>: >> Ralf Hildebrandt wrote: >>> * Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>: >>>> * Curtis Doty <Curtis@GreenKey.net>: >>>>> Yesterday Ralf Hildebrandt said: >>>>> >>>>>> The journal replay too quite a while. About 800 seconds. >>>>>> >>>>> Were there any other background iops on the underlying volume >>>>> devices? Like maybe raid reconstruction? >>>> I don't think so. The machine never powered off... >>> Again, 2.6.28.7 failed us and now we're encountering another journal >>> replay. Taking ages. This sucks. >>> >>> Questions: >>> >>> How can I find out (during normal operation) HOW MUCH of the >>> journal is actually in use? >>> >>> How can I resize the journal to be smaller, thus making a journal >>> replay faster? >>> >> It'd be better to get to the bottom of the problem ... maybe iostat >> while it's happening to see if IO is actually happening; run blktrace to >> see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > We had 24GB of reading from the journal device (or 12GB if it's > 512byte blocks). I wonder why? 24GB of reading from the journal device (during that 800s of replay during mount?), and your journal is 128M ... well that's odd. You say journal device; is this an external journal? I didn't think so from your first email, but is it? >> Can you find a way to reproduce this at will? > > Yes. My users will kill me, though. No spare box, eh :( >> Journal replay should *never* take this long, AFAIK. > > Amen > so let's figure it out :) -Eric _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
* Theodore Tso <tytso@mit.edu>:
> On Wed, Feb 25, 2009 at 10:31:42AM -0600, Eric Sandeen wrote: > > > > It'd be better to get to the bottom of the problem ... maybe iostat > > while it's happening to see if IO is actually happening; run blktrace to > > see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > > > Can you find a way to reproduce this at will? > > > > Journal replay should *never* take this long, AFAIK. > > Indeed. The journal is 128 megs, as I recall. So even if the journal > was completely full, if it's taking 800 seconds, that's a write rate > of 0.16 Mb/S (164 kb/second). That is indeed way too slow. The problem seems to be with the external journal which I recently changed to. It's a 32GB partition. My timings seem to indicate that ALL OF IT was being replayed > I assume this wasn't your boot partition, so the journal replay was > being done by e2fsck, right? Yes > Or are you guys skipping e2fsck and the journal replay was happening > when you mounted the partition? Both. We tried both ways :) > If the journal replay is happening via e2fsck, is fsck running any > other filesystem checks in parallel? No, it's running alone. > Also, what is the geometry of your raid? How many disks, what RAID > level, and what is the chunk size? The journal replay is done a > filesystem block at a time, so it could be that it's turning into a > large number of read-modify-writes, which is trashing your performance > if the chunk size is really large. The RAID is made up from one logical volume, consisting of two drives sda and sdb, each containing 6 disks in a hardware RAID5 setup. > The other thing that might explain the performan problem is if the > somehow the number of multiple outstanding requests allowed by the > hard drive has been clamped down to a very small number, and so a > large number of small read/write requests is really killing > performance. The system dmesg log might have some hidden clues about > that. dmesg is silent -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
Questions regarding journal replay
* Eric Sandeen <sandeen@redhat.com>:
> >> It'd be better to get to the bottom of the problem ... maybe iostat > >> while it's happening to see if IO is actually happening; run blktrace to > >> see where IO is going, do a few sysrq-t's to see where threads are at, etc. > > > > We had 24GB of reading from the journal device (or 12GB if it's > > 512byte blocks). I wonder why? > > 24GB of reading from the journal device (during that 800s of replay > during mount?), and your journal is 128M ... well that's odd. After my initial report I removed the journal and created an external journal on a 32GB partition. Hoping it would be faster, since accoriding to the docs. the journal size is limited to 128MB. > You say journal device; is this an external journal? I didn't think so > from your first email, but is it? It is now. # dumpe2fs -h /dev/cciss/c0d0p6 dumpe2fs 1.41.3 (12-Oct-2008) Filesystem volume name: journal_device Last mounted on: <not available> Filesystem UUID: 1a7063f5-8965-40f2-9feb-e37d6ac467e9 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: journal_dev Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 0 Block count: 8488436 Reserved block count: 0 Free blocks: 0 Free inodes: 0 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 0 Inode blocks per group: 0 Filesystem created: Thu Feb 5 14:05:36 2009 Last mount time: n/a Last write time: Thu Feb 5 14:15:26 2009 Mount count: 0 Maximum mount count: 30 Last checked: Thu Feb 5 14:05:36 2009 Check interval: 15552000 (6 months) Next check after: Tue Aug 4 15:05:36 2009 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Default directory hash: half_md4 Directory Hash Seed: fddb247a-97df-4582-bfcd-816ef8c17ab2 Journal block size: 4096 Journal length: 8488436 Journal first block: 2 Journal sequence: 0x0027c611 Journal start: 2 Journal number of users: 1 Journal users: 032613d3-6035-4872-bc0a-11db92feec5e -- Ralf Hildebrandt Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962 Hindenburgdamm 30 | 12200 Berlin _______________________________________________ Ext3-users mailing list Ext3-users@redhat.com https://www.redhat.com/mailman/listinfo/ext3-users |
| All times are GMT. The time now is 09:35 PM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.