FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > EXT3 Users

 
 
LinkBack Thread Tools
 
Old 03-12-2013, 10:56 PM
Vincent Caron
 
Default ext4 and extremely slow filesystem traversal

Hello list,

I have troubles with the daily backup of a modest filesystem which
tends to take more that 10 hours. I have ext4 all over the place on ~200
servers and never ran into such a problem.

The filesystem capacity is 300 GB (19,6M inodes) with 196 GB (9,3M
inodes) used. It's mounted 'defaults,noatime'. It sits on a hardware
RAID array thru plain LVM slices. The RAID array is a RAID5 running on
5x SATA 500G disks, with a battery-backed (RAM) cache and write-back
cache policy. To be precise, it's an Areca 1231.

The hardware RAID array use 64kB stripes and I've configured the
filesystem with 4kB blocks and stride=16. It also has 0 reserved blocks.
In other works the fs was created with 'mkfs -t ext4 -E stride=16 -m 0
-L volname /dev/vgX/Y'. I'm attaching the mke2fs.conf for reference too.

Everything is running with Debian Squeeze and its 2.6.32 kernel (amd64
flavour), on a 4 cores and 4 GB RAM server.

I ran a tiobench tonight on an idle instance (I have two identicals
systems - hw, sw, data - with exactly the same pb). I've attached
results as plain text to protect them from line wrapping. They look fine
to me.

When I try to backup the problematic filesystem with tar, rsync or
whatever tool traversing the whole filesystem, things are awful. I know
that this filesystem has *lots* of directories, most with few or no
files in them. Tonight I ran a simple 'find /path/to/vol -type d |pv
-bl' (counts directories as they are found), I stopped it more than 2
hours later : it was not done, and already counted more than 2M
directories. IO stats showed 1000 read calls/sec with avq=1 and avio=5
ms. CPU is 2% so it is totally I/O bound. This looks like the worst
random read case to me.

I even tried a hack which tries to sort directories while traversing
the filesystem to no avail.

Right now I don't even know how to analyze my filesystem further.
Sorry for not being able to describe it more accurately. I'm in search
for any advice or direction to improve this situation. While keeping
using ext4 of course .

PS: I did ask to the developers to not abuse the filesystem that way,
and that in 2013 it's okay to have 10k+ files per directory... No
success, so I guess I'll have to work around it.

filer:/srv/painfulvol/bench# tiobench --size 10000
Run #1: /usr/bin/tiotest -t 8 -f 1250 -r 500 -b 4096 -d . -T

Unit information
================
File size = megabytes
Blk Size = bytes
Rate = megabytes per second
CPU% = percentage of CPU used during the test
Latency = milliseconds
Lat% = percent of requests that took longer than X seconds
CPU Eff = Rate divided by CPU% - throughput per cpu load

Sequential Reads
File Blk Num Avg Maximum Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----
2.6.32-5-amd64 10000 4096 1 215.82 42.21% 0.017 1384.29 0.00000 0.00000 511
2.6.32-5-amd64 10000 4096 2 129.51 48.53% 0.057 5115.46 0.00020 0.00000 267
2.6.32-5-amd64 10000 4096 4 89.80 66.26% 0.168 6697.64 0.00043 0.00000 136
2.6.32-5-amd64 10000 4096 8 77.11 113.3% 0.394 6750.12 0.00102 0.00000 68

Random Reads
File Blk Num Avg Maximum Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----
2.6.32-5-amd64 10000 4096 1 0.79 0.302% 4.951 58.56 0.00000 0.00000 260
2.6.32-5-amd64 10000 4096 2 0.41 0.328% 17.165 174.55 0.00000 0.00000 126
2.6.32-5-amd64 10000 4096 4 0.80 1.024% 18.848 358.64 0.00000 0.00000 78
2.6.32-5-amd64 10000 4096 8 0.82 1.801% 35.989 808.74 0.00000 0.00000 45

Sequential Writes
File Blk Num Avg Maximum Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----
2.6.32-5-amd64 10000 4096 1 243.70 78.53% 0.014 492.80 0.00000 0.00000 310
2.6.32-5-amd64 10000 4096 2 186.89 150.9% 0.037 1969.62 0.00000 0.00000 124
2.6.32-5-amd64 10000 4096 4 113.90 209.8% 0.122 6303.26 0.00137 0.00000 54
2.6.32-5-amd64 10000 4096 8 88.32 336.6% 0.307 9451.83 0.00285 0.00000 26

Random Writes
File Blk Num Avg Maximum Lat% Lat% CPU
Identifier Size Size Thr Rate (CPU%) Latency Latency >2s >10s Eff
---------------------------- ------ ----- --- ------ ------ --------- ----------- -------- -------- -----
2.6.32-5-amd64 10000 4096 1 107.11 101.4% 0.009 0.06 0.00000 0.00000 106
2.6.32-5-amd64 10000 4096 2 173.32 337.2% 0.010 0.04 0.00000 0.00000 51
2.6.32-5-amd64 10000 4096 4 224.92 921.3% 0.011 0.76 0.00000 0.00000 24
2.6.32-5-amd64 10000 4096 8 206.05 1598.% 0.012 1.00 0.00000 0.00000 13
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_a ttr
blocksize = 4096
inode_size = 256
inode_ratio = 16384

[fs_types]
ext3 = {
features = has_journal
}
ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir _nlink,extra_isize
inode_size = 256
}
ext4dev = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir _nlink,extra_isize
inode_size = 256
options = test_fs=1
}
small = {
blocksize = 1024
inode_size = 128
inode_ratio = 4096
}
floppy = {
blocksize = 1024
inode_size = 128
inode_ratio = 8192
}
news = {
inode_ratio = 4096
}
largefile = {
inode_ratio = 1048576
blocksize = -1
}
largefile4 = {
inode_ratio = 4194304
blocksize = -1
}
hurd = {
blocksize = 4096
inode_size = 128
}
_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 01:52 AM
"Theodore Ts'o"
 
Default ext4 and extremely slow filesystem traversal

On Wed, Mar 13, 2013 at 12:56:15AM +0100, Vincent Caron wrote:
>
> I even tried a hack which tries to sort directories while traversing
> the filesystem to no avail.

Did you sort results from readdir() by inode number? i.e., such as
what the following LD_PRELOAD hack does?

https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint

> Right now I don't even know how to analyze my filesystem further.
> Sorry for not being able to describe it more accurately. I'm in search
> for any advice or direction to improve this situation. While keeping
> using ext4 of course .

Try running "e2fsck -fv /dev/XXX" and send me the output.

Also useful would be the output of "e2freefrag /dev/XXX" and "dumpe2fs -h"

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 08:19 AM
Vincent Caron
 
Default ext4 and extremely slow filesystem traversal

On 13/03/2013 03:52, Theodore Ts'o wrote:
>
> Did you sort results from readdir() by inode number? i.e., such as
> what the following LD_PRELOAD hack does?
>
> https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint

I don't think I tried this specific hack, I'm having a go right now.
Is is still useful if each directory only holds a few inodes ?


>> Right now I don't even know how to analyze my filesystem further.
>> Sorry for not being able to describe it more accurately. I'm in search
>> for any advice or direction to improve this situation. While keeping
>> using ext4 of course .
>
> Try running "e2fsck -fv /dev/XXX" and send me the output.
>
> Also useful would be the output of "e2freefrag /dev/XXX" and "dumpe2fs -h"

Information attached. Dumpfs said: dumpe2fs 1.42.5 (29-Jul-2012).

Thanks for your help !
Device: /dev/vg-raid3/xyz
Blocksize: 4096 bytes
Total blocks: 78643200
Free blocks: 26099944 (33.2%)

Min. free extent: 4 KB
Max. free extent: 557632 KB
Avg. free extent: 468 KB

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
4K... 8K- : 52301 52301 0.20%
8K... 16K- : 36907 87967 0.34%
16K... 32K- : 36063 187694 0.72%
32K... 64K- : 41123 433236 1.66%
64K... 128K- : 26411 579943 2.22%
128K... 256K- : 16236 745344 2.86%
256K... 512K- : 6073 513906 1.97%
512K... 1024K- : 4378 774518 2.97%
1M... 2M- : 699 244092 0.94%
2M... 4M- : 392 280118 1.07%
4M... 8M- : 289 420979 1.61%
8M... 16M- : 308 906478 3.47%
16M... 32M- : 289 1520835 5.83%
32M... 64M- : 67 668899 2.56%
64M... 128M- : 479 13748140 52.67%
128M... 256M- : 16 875692 3.36%
256M... 512M- : 39 3920394 15.02%
512M... 1024M- : 1 139408 0.53%
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

9345910 inodes used (47.54%)
12447 non-contiguous files (0.1%)
34 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 9345889/11
52543256 blocks used (66.81%)
0 bad blocks
1 large file

3633315 regular files
5712586 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
9345901 files
_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 10:59 AM
Vincent Caron
 
Default ext4 and extremely slow filesystem traversal

On 13/03/2013 10:19, Vincent Caron wrote:
>> > Did you sort results from readdir() by inode number? i.e., such as
>> > what the following LD_PRELOAD hack does?
>> >
>> > https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/contrib/spd_readdir.c?h=maint
> I don't think I tried this specific hack, I'm having a go right now.
> Is is still useful if each directory only holds a few inodes ?

Same slowness, I ran :

filer:~# gcc -shared -fPIC -ldl -o spd_readdir.so spd_readdir.c
filer:~# LD_PRELOAD=./spd_readdir.so find /srv/vol -type d |pv -bl

I stopped the experiment at +54min with 845k directories found (which
gives roughly the same rate of 1M directories / hour, and I know there
are more that 2M of them).

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 07:33 PM
"Theodore Ts'o"
 
Default ext4 and extremely slow filesystem traversal

On Wed, Mar 13, 2013 at 10:19:52AM +0100, Vincent Caron wrote:
>
> 3633315 regular files
> 5712586 directories
> 0 character device files
> 0 block device files
> 0 fifos
> 0 links
> 0 symbolic links (0 fast symbolic links)
> 0 sockets
> --------
> 9345901 files (really in-use inodes)

Wow. You have more directories than regular files! Given that there
are no hard links, that implies that you have at least 2,079,271
directories which are ***empty***.

The inline data feature (which is still in testing and isn't something
I can recommend for production use yet) is probably the best hope for
you. But probably the best thing you can do is to harrague your
developers to ask what the heck they are doing....

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 07:49 PM
Vincent Caron
 
Default ext4 and extremely slow filesystem traversal

On 13/03/2013 21:33, Theodore Ts'o wrote:
> Wow. You have more directories than regular files! Given that there
> are no hard links, that implies that you have at least 2,079,271
> directories which are ***empty***.

Awful, isn't it ? I knew directories were abused, but didn't know that
'e2fsck -v' would display the exact figures (since I never waited 5+
hours to scan the whole filesystem). Nice to know.


> The inline data feature (which is still in testing and isn't something
> I can recommend for production use yet) is probably the best hope for
> you. But probably the best thing you can do is to harrague your
> developers to ask what the heck they are doing....

Indeed, these filers are storing live and sensitive data and are
conservatively running stable OS and well known kernels.

Thanks for your advice, I'll actively work with the devs in order to
refactor their filesystem layout.

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 07:52 PM
"Theodore Ts'o"
 
Default ext4 and extremely slow filesystem traversal

On Wed, Mar 13, 2013 at 09:49:20PM +0100, Vincent Caron wrote:
> On 13/03/2013 21:33, Theodore Ts'o wrote:
> > Wow. You have more directories than regular files! Given that there
> > are no hard links, that implies that you have at least 2,079,271
> > directories which are ***empty***.
>
> Awful, isn't it ? I knew directories were abused, but didn't know that
> 'e2fsck -v' would display the exact figures (since I never waited 5+
> hours to scan the whole filesystem). Nice to know.

To be clear, that's at least two million directories assuming that all
of the other directories have but a single file in them(!). In reality
you probably have a lot more than 2 million empty directories....

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 08:29 PM
 
Default ext4 and extremely slow filesystem traversal

> I have troubles with the daily backup of a modest filesystem
> which tends to take more that 10 hours. [ ... ] with 196 GB
> (9,3M inodes) used.

That is roughly 1M inodes/hour and 20GB/hour, or nearly 300
inodes/s and nearly 6MB/s. These are very good numbers for high
random IOPS loads, and as seen later, you have one.

> It's mounted 'defaults,noatime'.

That helps.

> It sits on a hardware RAID array thru plain LVM slices.

That's the pointless default... But does not particularly slow
things down here.

> The RAID array is a RAID5 running on 5x SATA 500G disks, with a
> battery-backed (RAM) cache and write-back cache policy. To be
> precise, it's an Areca 1231. The hardware RAID array use 64kB
> stripes and I've configured the filesystem with 4kB blocks and
> stride=16.

The striping or alignment are not relevant on reads, but the
stride matters a great deal as to metadata parallelism, and here
it is set to 64KiB. But the array stride is 16KiB (a 4-wide
stripe of 64KiB). But since it is an integral multiple it should
be about as good. And since the backup performance is pretty
good, that seems the case.

> It also has 0 reserved blocks.

That's usually a truly terrible setting (20% is a much better
value), but your filesystem is not very full anyhow.

> When I try to backup the problematic filesystem with tar, rsync
> or whatever tool traversing the whole filesystem, things are
> awful.

Rather they are pretty good. Each 500GB SATA disk can usually do
somewhat less than 100 random IOPS/second, there are 4 disks in
each stripe when reading, and you are getting nearly 300 inodes/s
and 5MB/s, quite close to the maximum. On random loads with
smallish records typical rotating disks have transfer rates of
0.5MB to 1.5 MB/s, and you are getting rather more than that
(mostly thanks to the 20KiB average inode size).

You are getting pretty good delivery from 'ext4' and a very low
random IOPS storage system on a highly randomized workload:

> I know that this filesystem has *lots* of directories, most
> with few or no files in them.

That's a really bad idea.

> Tonight I ran a simple 'find /path/to/vol -type d |pv -bl'
> (counts directories as they are found), I stopped it more than
> 2 hours later : it was not done, and already counted more than
> 2M directories.

That's the usual 1M inodes/s.

> [ ... ] I'm in search for any advice or direction to improve
> this situation. While keeping using ext4 of course .

Well, any system administrator would tell you the same: your
backup workload and your storage system are mismatched, and the
best solution is probably to use 146GB SAS 15K RPM disks for the
same capacity (or more). Or perhaps recent enteprise level SSDs.

The "small file" problem is ancient, and I call it the
"mailstore" problems from its typical incarnation:

http://www.sabi.co.uk/blog/12-thr.html#120429

> PS: I did ask to the developers to not abuse the filesystem
> that way,

The "I use the filesystem as a DBMS" attitude is really very
common among developers. It is cost-free to them, and backup (and
system) administrators bear the cost when the filesystem fills
up. Because at the beginning everything looks fine. Designing
stuff that seems cheap and fast at the beginning even if it
becomes very bad after some time is a good way to look like a
winner in most organizations.

> and that in 2013 it's okay to have 10k+ files per directory...

It's not, it is a very bad idea. In 2013, just like in 1973, or
in 1993, it is a much better idea to use simple indexed files to
keep a collection of smallish records.

Directories are a classification system, not a database indexing
system. Here is an amusing report of the difference between the
two:

http://www.sabi.co.uk/blog/anno05-4th.html#051016

> No success, so I guess I'll have to work around it.

As a backup administrator you can't get much better from your
situation. You are already getting nearly the best performance
for whole tree scans of very random small records on a low random
IOPS storage layer.

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-13-2013, 08:50 PM
Vincent Caron
 
Default ext4 and extremely slow filesystem traversal

On 13/03/2013 22:29, Peter Grandi wrote:
>> > It also has 0 reserved blocks.
> That's usually a truly terrible setting (20% is a much better
> value), but your filesystem is not very full anyhow.

This filesystem has no file owned by root and won't have any. I
thought in this case -m0 would be a good idea.

Thanks a lot for your detailed insight on the various performance
figures, I didn't do the proper math to realize that this inode reading
rate was actually *good*.

Fortunately the client is technically savvy, and pointing him at this
mailing-list thread will help make him the right decision.

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-14-2013, 12:05 AM
"Theodore Ts'o"
 
Default ext4 and extremely slow filesystem traversal

On Wed, Mar 13, 2013 at 09:49:20PM +0100, Vincent Caron wrote:
> On 13/03/2013 21:33, Theodore Ts'o wrote:
> > Wow. You have more directories than regular files! Given that there
> > are no hard links, that implies that you have at least 2,079,271
> > directories which are ***empty***.
>
> Awful, isn't it ? I knew directories were abused, but didn't know that
> 'e2fsck -v' would display the exact figures (since I never waited 5+
> hours to scan the whole filesystem). Nice to know.

Just as a note, e2fsck -v can sometimes get this information much more
quickly than other alternatives, since it can scan the file system in
inode order, instead of the essentially random order.

Just as a side, if you just want to get a rough count of the number of
directories, you can get that by grabbing the information out of
dumpe2fs.

Group 624: (Blocks 20447232-20479999) [ITABLE_ZEROED]
Checksum 0xd3f5, unused inodes 4821
Block bitmap at 20447232 (+0), Inode bitmap at 20447248 (+16)
Inode table at 20447264-20447775 (+32)
24103 free blocks, 4821 free inodes, 435 directories, 4821 unused inodes
^^^^^^^^^^^^^^^
Free blocks: 20455889, 20455898-20479999
Free inodes: 5115180-5120000

Dumpe2fs doesn't actually sum the number of directories, and you won't
be able to differentiate the number of files that are regular files versus
symlinks, device nodes, etc., but if you just want to get the number
of directories, you can get this number by getting the information out
of dumpe2fs without having to wait for e2fsck to complete. You can
even do this with a mounted file system, but the number will of course
not necessarily be completely accurate if you do that.

(You can get the number of inodes in use by subtracting the number of
free inodes from the number of inodes in the file system. If you then
subtract the number of directories, then you can get the number of
non-directory inodes versus directory inodes.)

- Ted

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 

Thread Tools




All times are GMT. The time now is 03:38 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org