FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Development

 
 
LinkBack Thread Tools
 
Old 01-17-2012, 10:15 AM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Lars Wirzenius, le Tue 17 Jan 2012 10:45:20 +0000, a écrit :
> On Tue, Jan 17, 2012 at 10:30:20AM +0100, Samuel Thibault wrote:
> > Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> > > real user system max RSS elapsed cmd
> > > (s) (s) (s) (KiB) (s)
> > > 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> > > 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> > > 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
> >
> > And fdupes on the same set of files?
>
> real user system max RSS elapsed cmd
> (s) (s) (s) (KiB) (s)
> 3.1 2.4 5.5 62784 5.5 hardlink --dry-run files > /dev/null
> 1.1 0.4 1.6 15392 1.6 rdfind files > /dev/null
> 1.3 0.9 2.2 13936 2.2 fdupes -r -q files > /dev/null
> 1.9 0.2 2.1 9904 2.1 duff-0.5/src/duff -r files > /dev/null
>
> Someone should run the benchmark on a large set of data, preferably
> on various kinds of real data, rather than my small synthetic data set.

On my PhD work directory, with various stuff in it (500MiB, 18000 files,
big but also small files (svn/git checkouts etc)), everything being in
cache already (no disk I/O):

hardlink --dry-run . > /dev/null 0,55s user 0,18s system 99% cpu 0,734 total
rdfind . > /dev/null 0,68s user 0,19s system 99% cpu 0,877 total
fdupes -q -r . > /dev/null 2> /dev/null 0,80s user 0,90s system 99% cpu 1,708 total
~/src/duff-0.5/src/duff -r . > /dev/null 1,53s user 0,08s system 99% cpu 1,610 total

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117111516.GM4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117111516.GM4320@type.bordeaux.inria.fr
 
Old 01-17-2012, 11:41 AM
Roland Mas
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Samuel Thibault, 2012-01-17 12:03:41 +0100 :

[...]

> I'm not sure to understand what you mean exactly. If you have even
> just a hundred files of the same size, you will need ten thousand file
> comparisons!

I'm sure that can be optimised. Read all 100 files in parallel,
comparing blocks of similar offset. You need to perform 99 comparisons
on each block for as long as blocks are identical; when one of the 99
doesn't match, you can split your set of files according to this offset
into at least 2 equivalence classes, which you consider subsets from now
on. A subset with only one file can be eliminated from the rest of the
scan, and even if there are only multiple-file subsets, the number of
comparisons to be performed at further steps is reduced by at least one.

Roland.
--
Roland Mas

You can tune a filesystem, but you can't tuna fish.
-- in the tunefs(8) manual page.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87k44qxzej.fsf@mirexpress.internal.placard.fr.eu.o rg">http://lists.debian.org/87k44qxzej.fsf@mirexpress.internal.placard.fr.eu.o rg
 
Old 01-17-2012, 12:02 PM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Samuel Thibault, le Tue 17 Jan 2012 12:15:16 +0100, a écrit :
> Lars Wirzenius, le Tue 17 Jan 2012 10:45:20 +0000, a écrit :
> > On Tue, Jan 17, 2012 at 10:30:20AM +0100, Samuel Thibault wrote:
> > > Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> > > > real user system max RSS elapsed cmd
> > > > (s) (s) (s) (KiB) (s)
> > > > 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> > > > 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> > > > 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
> > >
> > > And fdupes on the same set of files?
> >
> > real user system max RSS elapsed cmd
> > (s) (s) (s) (KiB) (s)
> > 3.1 2.4 5.5 62784 5.5 hardlink --dry-run files > /dev/null
> > 1.1 0.4 1.6 15392 1.6 rdfind files > /dev/null
> > 1.3 0.9 2.2 13936 2.2 fdupes -r -q files > /dev/null
> > 1.9 0.2 2.1 9904 2.1 duff-0.5/src/duff -r files > /dev/null
> >
> > Someone should run the benchmark on a large set of data, preferably
> > on various kinds of real data, rather than my small synthetic data set.
>
> On my PhD work directory, with various stuff in it (500MiB, 18000 files,
> big but also small files (svn/git checkouts etc)), everything being in
> cache already (no disk I/O):
>
> hardlink -t --dry-run . > /dev/null 1,06s user 0,46s system 99% cpu 1,538 total
> rdfind . > /dev/null 0,68s user 0,19s system 99% cpu 0,877 total
> fdupes -q -r . > /dev/null 2> /dev/null 0,80s user 0,90s system 99% cpu 1,708 total
> ~/src/duff-0.5/src/duff -r . > /dev/null 1,53s user 0,08s system 99% cpu 1,610 total

And with nothing in cache, SSD hard drive:

hardlink -t --dry-run . > /dev/null 1,86s user 1,23s system 12% cpu 24,260 total
rdfind . > /dev/null 1,18s user 1,31s system 8% cpu 27,837 total
fdupes -q -r . > /dev/null 2> /dev/null 1,30s user 2,13s system 11% cpu 29,820 total
~/src/duff-0.5/src/duff -r . > /dev/null 1,88s user 0,47s system 16% cpu 13,949 total

(yes, user time is different, and measures are stable. Also note that
I have added -t to hardlink, otherwise it takes file timestamp into
account).

I guess duff gets a clear win because it does not systematically compute
the checksum of files with the same size, but first reads a few bytes,
for the big files.

samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117130245.GN4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117130245.GN4320@type.bordeaux.inria.fr
 
Old 01-17-2012, 12:05 PM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Roland Mas, le Tue 17 Jan 2012 13:41:23 +0100, a écrit :
> Samuel Thibault, 2012-01-17 12:03:41 +0100 :
>
> [...]
>
> > I'm not sure to understand what you mean exactly. If you have even
> > just a hundred files of the same size, you will need ten thousand file
> > comparisons!
>
> I'm sure that can be optimised. Read all 100 files in parallel,
> comparing blocks of similar offset. You need to perform 99 comparisons
> on each block for as long as blocks are identical;

Ah, right. So you'll start writing yet another tool?

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117130510.GO4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117130510.GO4320@type.bordeaux.inria.fr
 
Old 01-17-2012, 12:17 PM
Lars Wirzenius
 
Default Bug#656142: ITP: duff -- Duplicate file finder

On Tue, Jan 17, 2012 at 02:05:10PM +0100, Samuel Thibault wrote:
> Roland Mas, le Tue 17 Jan 2012 13:41:23 +0100, a écrit :
> > Samuel Thibault, 2012-01-17 12:03:41 +0100 :
> >
> > [...]
> >
> > > I'm not sure to understand what you mean exactly. If you have even
> > > just a hundred files of the same size, you will need ten thousand file
> > > comparisons!
> >
> > I'm sure that can be optimised. Read all 100 files in parallel,
> > comparing blocks of similar offset. You need to perform 99 comparisons
> > on each block for as long as blocks are identical;
>
> Ah, right. So you'll start writing yet another tool?

I've implemented pretty much that (http://liw.fi/dupfiles), but my
duplicate file finder is not so much better than existing ones in
Debian that I would inflict it on Debian. But the algorithm works
nicely, and works even for people who research hash collisions.

--
Freedom-based blog/wiki/web hosting: http://www.branchable.com/
 
Old 01-17-2012, 12:29 PM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Samuel Thibault, le Tue 17 Jan 2012 14:02:45 +0100, a écrit :
> On my PhD work directory, with various stuff in it (500MiB, 18000 files,
> big but also small files (svn/git checkouts etc)), everything being in
> cache already (no disk I/O):
>
> hardlink -t --dry-run . > /dev/null 1,06s user 0,46s system 99% cpu 1,538 total
> rdfind . > /dev/null 0,68s user 0,19s system 99% cpu 0,877 total
> fdupes -q -r . > /dev/null 2> /dev/null 0,80s user 0,90s system 99% cpu 1,708 total
> ~/src/duff-0.5/src/duff -r . > /dev/null 1,53s user 0,08s system 99% cpu 1,610 total
dupfiles . > /dev/null 0,82s user 0,21s system 99% cpu 1,032 total

> And with nothing in cache, SSD hard drive:
>
> hardlink -t --dry-run . > /dev/null 1,86s user 1,23s system 12% cpu 24,260 total
> rdfind . > /dev/null 1,18s user 1,31s system 8% cpu 27,837 total
> fdupes -q -r . > /dev/null 2> /dev/null 1,30s user 2,13s system 11% cpu 29,820 total
> ~/src/duff-0.5/src/duff -r . > /dev/null 1,88s user 0,47s system 16% cpu 13,949 total
dupfiles . > /dev/null 1,95s user 0,98s system 9% cpu 29,363 total


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117132946.GA11582@type.bordeaux.inria.fr">htt p://lists.debian.org/20120117132946.GA11582@type.bordeaux.inria.fr
 
Old 01-17-2012, 01:30 PM
Johan Henriksson
 
Default Bug#656142: ITP: duff -- Duplicate file finder

> Ah, right. So you'll start writing yet another tool?



I've implemented pretty much that (http://liw.fi/dupfiles), but my

duplicate file finder is not so much better than existing ones in

Debian that I would inflict it on Debian. But the algorithm works

nicely, and works even for people who research hash collisions.



since we're onto this topic - a while ago I needed a duplicate finder that could identify not only identical files but also identical folders. it has a bunch of algorithms in there, and can make approximate searches as well (for performance). but does anyone know if there is a tool like this already or do I have a reason to continue developing it?


/Johan

--
-----------------------------------------------------------
Johan Henriksson
PhD student, Karolinska Institutet
http://mahogny.areta.org* http://www.endrov.net
 
Old 01-17-2012, 09:54 PM
Andy Smith
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Hello,

On Tue, Jan 17, 2012 at 09:12:58AM +0000, Lars Wirzenius wrote:
> rdfind seems to be quickest one, but duff compares well with hardlink,
> which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
> Debian so far.

Does anyone know of a duplicate file finder that can keep its
database of seen files in an on-disk database instead of RAM? When
looking for duplicates in a tree of hundreds of millions of files
this can otherwise require quite a lot of RAM.

Perhaps it can be worked around using lots of swap, but I would have
thought this could lead to other processes getting swapped out, when
generally I would rather that the duplicate finder just got slower
itself.

Cheers,
Andy
 
Old 01-20-2012, 01:49 AM
Kamal Mostafa
 
Default Bug#656142: ITP: duff -- Duplicate file finder

> On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
> > * Package name : duff
> > * URL : http://duff.sourceforge.net/

On Tue, 2012-01-17 at 09:56 +0100, Simon Josefsson wrote:
> If there aren't warnings about use of SHA1 in the tool, there should
> be. While I don't recall any published SHA1 collisions, SHA1 is
> considered broken and shouldn't be used if you want to trust your
> comparisons. I'm assuming the tool supports SHA256 and other SHA2
> hashes as well? It might be useful to make sure the defaults are
> non-SHA1.

Duff supports SHA1, SHA256, SHA384 and SHA512 hashes. The default is
SHA1. For comparison, rdfind supports MD5 but only SHA1 hashes. Thanks
for the note Simon -- I'll bring it to the attention of the upstream
author, Camilla Berglund.

On Tue, 2012-01-17 at 09:12 +0000, Lars Wirzenius wrote:
> rdfind seems to be quickest one, but duff compares well with hardlink,
> which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
> Debian so far.
>
> This was done using my benchmark-cmd utility in my extrautils
> collection (not in Debian): http://liw.fi/extrautils/ for source.

Thanks for the pointer to your benchmark-cmd tool, Lars. Very handy!
My results with it mirrored yours -- of the similar tools, duff appears
to lag only rdfind in performance (for my particular dataset, at least).

I looked into duff's methods a bit and discovered a few easy performance
optimizations that may speed it up a bit more. The author is reviewing
my proposed patch now, and seems very open to collaboration.

> Personally, I would be wary of using checksums for file comparisons,
> since comparing files byte-by-byte isn't slow (you only need to
> do it to files that are identical in size, and you need to read
> all the files anyway).

Byte-by-byte might well be slower then checksums, if you end up faced
with N>2 very large (uncacheable) files of identical size but unique
contents. They all need to be checked against each other so each of the
N files would need to be read N-1 times. Anyway, duff actually *does*
offer byte-by-byte comparison as an option (rdfind does not).

> I also think we've now got enough of duplicate file finders in
> Debian that it's time to consider whether we need so many. It's
> too bad they all have incompatible command line syntaxes, or it
> would be possible to drop some. (We should accept a new one if
> it is better than the existing ones, of course. Evidence required.)

To me, the premise that a new package must be better than existing
similar ones ("evidence required", no less) seems pretty questionable.
It may not be so easy to establish just what "better than" means, and it
puts us in a position of making value judgments for our users that they
should be able to make for themselves.

While I do think it is productive to compare performance of these
similar tools to each other, I don't see much value in pitting them
against each other in benchmark wars as criteria of acceptance into
Debian.

Here we have a good quality DFSG-compliant package with an active
upstream and a willing DD maintainer. While similar tools do exist
already in Debian, they do not offer identical feature sets or user
interfaces, and only one of them has been shown to outperform duff in
quick spot checks. Some users have expressed a preference for duff over
the others. Does that make it "better than the existing ones"? My
answer: Who cares? Nobody is making us choose only one.

In my view, its not really a problem if carry multiple duplicate file
detectors in Debian, and that we will best serve our users by letting
them choose their preferred tool for the job. And by allowing such
packages into Debian we encourage their improvement, to everyone's
benefit.

-Kamal
 

Thread Tools




All times are GMT. The time now is 07:17 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org