FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > ArchLinux > ArchLinux General Discussion

 
 
LinkBack Thread Tools
 
Old 03-13-2010, 02:05 AM
Shridhar Daithankar
 
Default Btrfs more than twice as fast compared to ext4

Hi,

Just wanted to share an interesting experience I had today.

Check http://ghodechhap.net/btrfs.performance.txt
--
Regards
Shridhar
 
Old 03-15-2010, 05:01 AM
Nilesh Govindarajan
 
Default Btrfs more than twice as fast compared to ext4

On 03/13/2010 08:35 AM, Shridhar Daithankar wrote:

Hi,

Just wanted to share an interesting experience I had today.

Check http://ghodechhap.net/btrfs.performance.txt


Great. A stable version released ?

--
Nilesh Govindarajan
Site & Server Adminstrator
www.itech7.com
 
Old 03-15-2010, 09:14 AM
Nathan Wayde
 
Default Btrfs more than twice as fast compared to ext4

On 13/03/10 03:05, Shridhar Daithankar wrote:

Hi,

Just wanted to share an interesting experience I had today.

Check http://ghodechhap.net/btrfs.performance.txt


Maybe you're looking for http://docs.python.org/library/filecmp.html

One cannot help but think that you took a disk-bound process and turned
it into a cpu-bound one. Since you're just interested in which files are
different you should have just used `cmp` instead of `md5sum`
the latter is just overkill and I'd assume calling an external command
that many times can't be very nice either.


here are some comparisons, they use /usr/lib - i figured 75000 files
should be a good test... I made this as deliberately
unfair/in-comparable as possible, I wanted to show the potential
overhead of calling md5sum that many times.


[[ky] ~]# }} time python -c "import filecmp; print
len(filecmp.dircmp('/usr/lib', '/temp/lib').diff_files)"

80

real 2m24.240s
user 0m10.123s
sys 0m10.669s

That looks reasonable, on this crappy 5400 rpm (sata) laptop harddisk
with ext4.


You'll note that test below is pretty much just to see how much time
calling md5sum takes, /tmp/a is a 1 byte file(contains the character a,
to give md5sum as simple a job as possible). /tmp is a tmpfs, not that
it matters as /tmp/a most likely remains in cached the entire time


[[ky] ~]# }} time find /temp/lib -type f | wc -l
75272

real 0m0.532s
user 0m0.140s
sys 0m0.383s

[[ky] ~]# }} time find /temp/lib -type f -exec md5sum /tmp/a ;

real 2m6.781s
user 0m2.200s
sys 0m15.409s

the disk-status light didn't come on at all during those 2mins meanwhile
I could hear my cpu-fan going crazy the whole time (1.6ghz). I should
note, the light remained on the entire time during the filecmp and cpu
stayed low(800mhz) for most of that time as well.
 
Old 03-15-2010, 11:48 PM
Shridhar Daithankar
 
Default Btrfs more than twice as fast compared to ext4

On Monday 15 March 2010 15:44:35 Nathan Wayde wrote:
> On 13/03/10 03:05, Shridhar Daithankar wrote:
> > Hi,
> >
> > Just wanted to share an interesting experience I had today.
> >
> > Check http://ghodechhap.net/btrfs.performance.txt
>
> Maybe you're looking for http://docs.python.org/library/filecmp.html
>
> One cannot help but think that you took a disk-bound process and turned
> it into a cpu-bound one. Since you're just interested in which files are
> different you should have just used `cmp` instead of `md5sum`
> the latter is just overkill and I'd assume calling an external command
> that many times can't be very nice either.
>
> here are some comparisons, they use /usr/lib - i figured 75000 files
> should be a good test... I made this as deliberately
> unfair/in-comparable as possible, I wanted to show the potential
> overhead of calling md5sum that many times.

I didn't know of cmp, thanks. I tried the same thing with cmp in loops and it
agrees with your comments that it is is totally I/O bound, not CPU bound at
all.

However, even in md5sum case, I/O was high too, the disk light was on all the
time. May be it was the case for CPU speed difference.

But as far as file system performance goes, the overhead should be identical
for both the runs, no?

Besides, I need to run the comparison(rather verification of file contents)
many times over during the application life-cycle and I cannot afford to bring
in another copy from disk. The working set is expected to be 30-40GB at a
time, 3GB is just test setup.

With md5sum, I can store it in database and verify it on one copy only.

And finally, it is terrible on timings. Running md5sum is lot faster, about 3
times in the best case.

shridhar@bheem /mnt1/shridhar/tmp/importtest.big$ time for i in `find . -type
f`;do cmp "$i" "/data/shridhar/tmp/4/$i";done

real 21m30.137s
user 0m27.665s
sys 1m21.581s
shridhar@bheem /data/shridhar/tmp/4$ time for i in `find . -type f`;do cmp
"$i" "/mnt1/shridhar/tmp/importtest.big/$i";done

real 6m26.988s
user 0m40.721s
sys 1m28.371s
shridhar@bheem /mnt1/shridhar/tmp/importtest.big$ time for i in `find . -type
f`;do cmp "$i" "/data/shridhar/tmp/4/$i";done

real 16m27.541s
user 0m37.281s
sys 1m23.995s

So when the source file system is btrfs, it is still couple of times faster at
least.
--
Regards
Shridhar
 
Old 03-16-2010, 08:11 AM
Nathan Wayde
 
Default Btrfs more than twice as fast compared to ext4

On 16/03/10 00:48, Shridhar Daithankar wrote:

[...]
But as far as file system performance goes, the overhead should be identical
for both the runs, no?

I'm not too sure about that. I'm guessing there is less seeking going on
with Btrfs. Some files systems (reiserfs + reiserfs4 IIRC) are very good
with many small files, better than the ext*fs, this may be another case
of that.



Besides, I need to run the comparison(rather verification of file contents)
many times over during the application life-cycle and I cannot afford to bring
in another copy from disk. The working set is expected to be 30-40GB at a
time, 3GB is just test setup.

With md5sum, I can store it in database and verify it on one copy only.


Fair enough.


And finally, it is terrible on timings. Running md5sum is lot faster, about 3
times in the best case.
[...]

wow, that's slow!


So when the source file system is btrfs, it is still couple of times faster at
least.
I still think you could achieve better times by not calling the external
command that many times.
Since you're already gonna store the checksums in a database, I'd just
write a proper program in python or something.


Or even just a shellscript, but you might wanna refrain from for .. in
`find .. , it's the slowest and that relies on the fact that your
filenames don't have spaces in them.


[[ky] ~]# }} time find /usr/bin -type f -print0 | xargs -0 md5sum > /tmp/1
real 0m3.633s

[[ky] ~]# }} time find /usr/bin -type f -exec md5sum "{}" ; > /tmp/2
real 0m10.196s
[[ky] ~]# }} time for i in `find /usr/bin -type f`;do md5sum "$i";done >
/tmp/3

real 0m11.245s

this last version missed a file because it has spaces in its name and as
result the file 3 was inconsistent with files 1 and 2


[[ky] ~]# }} diff /tmp/{1,2}
[[ky] ~]# }} diff /tmp/{3,2}
3054a3055
> 0c5d8f10aa0731671a00961f059dc46e /usr/bin/New SMB and DCERPC
features in Impacket.pdf


that was a test against just 4008, so you can imagine time savings with
50000+ files.
 
Old 03-16-2010, 11:53 AM
Shridhar Daithankar
 
Default Btrfs more than twice as fast compared to ext4

On Tuesday 16 March 2010 14:41:41 Nathan Wayde wrote:
> On 16/03/10 00:48, Shridhar Daithankar wrote:
> > [...]
> > But as far as file system performance goes, the overhead should be
> > identical for both the runs, no?
>
> I'm not too sure about that. I'm guessing there is less seeking going on
> with Btrfs. Some files systems (reiserfs + reiserfs4 IIRC) are very good
> with many small files, better than the ext*fs, this may be another case
> of that.

Yes btrfs does have tail packing i.e. storing inode and the file together in a
single block. However all the files I had in the tree were 50-55K in size and
that definitely does not fit in a block.

> I still think you could achieve better times by not calling the external
> command that many times.
> Since you're already gonna store the checksums in a database, I'd just
> write a proper program in python or something.

The application I am developing already has copy/copyttree and md5sum built-
in. I mmap the whole file and do memcpy/memcmp/md5sum in a single pass. That
is already a bit faster than native cp, which uses write and buffer
management.

I changed/refactored the tree copy code and created a new tree. And I wanted
to verify outside the application that the tree copy has gone good. Hence did
find/md5sum. This was a one time exercise only but the result were drastic
enough to be published.

--
Regards
Shridhar
 

Thread Tools




All times are GMT. The time now is 12:10 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org