FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Development

 
 
LinkBack Thread Tools
 
Old 01-16-2012, 07:58 PM
Kamal Mostafa
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Package: wnpp
Severity: wishlist
Owner: Kamal Mostafa <kamal@whence.com>


* Package name : duff
Version : 0.5
Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
* URL : http://duff.sourceforge.net/
* License : Zlib
Programming Lang: C
Description : Duplicate file finder

Duff is a command-line utility for identifying duplicates in a given set of
files. It attempts to be usably fast and uses the SHA family of message
digests as a part of the comparisons.



--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116205813.24274.12515.reportbug@localhost6.lo caldomain6">http://lists.debian.org/20120116205813.24274.12515.reportbug@localhost6.lo caldomain6
 
Old 01-16-2012, 08:03 PM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Kamal Mostafa, le Mon 16 Jan 2012 12:58:13 -0800, a écrit :
> Package: wnpp
> Severity: wishlist
> Owner: Kamal Mostafa <kamal@whence.com>
>
>
> * Package name : duff
> Version : 0.5
> Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
> * URL : http://duff.sourceforge.net/
> * License : Zlib
> Programming Lang: C
> Description : Duplicate file finder
>
> Duff is a command-line utility for identifying duplicates in a given set of
> files. It attempts to be usably fast and uses the SHA family of message
> digests as a part of the comparisons.

What is it the benefit over fdupes, rdfind, ...?

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116210316.GS4158@type.famille.thibault.fr">ht tp://lists.debian.org/20120116210316.GS4158@type.famille.thibault.fr
 
Old 01-16-2012, 08:32 PM
Axel Beckert
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Hi,

Samuel Thibault wrote:
> > * Package name : duff
> > Version : 0.5
> > Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
> > * URL : http://duff.sourceforge.net/
> > * License : Zlib
> > Programming Lang: C
> > Description : Duplicate file finder
> >
> > Duff is a command-line utility for identifying duplicates in a given set of
> > files. It attempts to be usably fast and uses the SHA family of message
> > digests as a part of the comparisons.
>
> What is it the benefit over fdupes, rdfind, ...?

..., hardlink, ...

Some of my coworkers prefer duff over the tools available in Debian,
too. I'm though no more sure why, but it's possible that speed was one
argument, because they ran it over several TB of data. Will check
what exactly was the reason back then.

Was thinking about packaging it myself already, so I may also sponsor
Kamal's package when it's ready.

Regards, Axel
--
,'`. | Axel Beckert <abe@debian.org>, http://people.debian.org/~abe/
: :' : | Debian Developer, ftp.ch.debian.org Admin
`. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
`- | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116213236.GY2744@sym.noone.org">http://lists.debian.org/20120116213236.GY2744@sym.noone.org
 
Old 01-16-2012, 09:07 PM
Joerg Jaspert
 
Default Bug#656142: ITP: duff -- Duplicate file finder

>> What is it the benefit over fdupes, rdfind, ...?
> ..., hardlink, ...

finddup from perforate

> Was thinking about packaging it myself already, so I may also sponsor
> Kamal's package when it's ready.

You just listed the third duplicate (and me no. 4), and still go blind
right on "ohoh, i sponsor it". Why? I hope its conditional on it being
vastly better than any of the others (speed, functionality, ...) and not
just "because".

Contrary to some common believe, Debian is not the dump for NIH, and
even if a little redundancy can't hurt, too much is just waste. Of our
time, of our mirrors (space and bandwidth), ...

--
bye, Joerg
Contrary to common belief, Arch:i386 is *not* the same as Arch: any.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87hazvb8sg.fsf@gkar.ganneff.de">http://lists.debian.org/87hazvb8sg.fsf@gkar.ganneff.de
 
Old 01-16-2012, 10:49 PM
Kamal Mostafa
 
Default Bug#656142: ITP: duff -- Duplicate file finder

On Mon, 2012-01-16 at 23:07 +0100, Joerg Jaspert wrote:
> >> What is it the benefit over fdupes, rdfind, ...?
> > ..., hardlink, ...
> finddup from perforate

After a quick evaluation of the various "find dupe files" tools, I was
attracted to try duff because:

1. It looked easier to use than the others.
2. This quote from its website[1] was exactly what I was looking for:
"Note that duff itself never modifies any files, but it's designed to
play nice with tools that do." The other dupe cleaner utilities left me
worried that they might trash something important if I got my command
line options wrong or forgot a --dry-run flag.


> > Was thinking about packaging it myself already, so I may also sponsor
> > Kamal's package when it's ready.

Thanks Axel, but I'm a DD myself, so won't need a sponsor.


> You just listed the third duplicate (and me no. 4), and still go blind
> right on "ohoh, i sponsor it". Why? I hope its conditional on it being
> vastly better than any of the others (speed, functionality, ...)

In my humble opinion, that would be an unreasonable pre-condition for
inclusion in Debian. Our standard for inclusion should not be that a
new package must be "vastly better" than other similar packages. That
would deny a new package the opportunity to build a user base and
possibly someday evolve to become the "vastly better" alternative
itself.

-Kamal

kamal@whence.com
kamal@debian.org

[1] http://duff.sourceforge.net/
 
Old 01-17-2012, 06:42 AM
martin f krafft
 
Default Bug#656142: ITP: duff -- Duplicate file finder

also sprach Kamal Mostafa <kamal@debian.org> [2012.01.17.0049 +0100]:
> In my humble opinion, that would be an unreasonable pre-condition for
> inclusion in Debian. Our standard for inclusion should not be that a
> new package must be "vastly better" than other similar packages. That
> would deny a new package the opportunity to build a user base and
> possibly someday evolve to become the "vastly better" alternative
> itself.

Right, but I'd say it needs to be better and the maintainer needs to
be able to argue how it is better.

--
.'`. martin f. krafft <madduck@d.o> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems

"die zeit für kleine politik ist vorbei.
schon das nächste jahrhundert
bringt den kampf um die erdherrschaft."
- friedrich nietzsche
 
Old 01-17-2012, 08:12 AM
Lars Wirzenius
 
Default Bug#656142: ITP: duff -- Duplicate file finder

On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
> * Package name : duff
> * URL : http://duff.sourceforge.net/

A quick speed comparison:

real user system max RSS elapsed cmd
(s) (s) (s) (KiB) (s)
3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null

rdfind seems to be quickest one, but duff compares well with hardlink,
which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
Debian so far.

This was done using my benchmark-cmd utility in my extrautils
collection (not in Debian): http://liw.fi/extrautils/ for source.
The exact command to generate the above table:

benchmark-cmd
--setup='genbackupdata --create=100m files'
--setup='cp -a files/0 files/copy'
--cleanup='rm -rf files'
--verbose
--command='hardlink --dry-run files > /dev/null'
--command='rdfind files > /dev/null'
--command='duff-0.5/src/duff -r files > /dev/null'

Personally, I would be wary of using checksums for file comparisons,
since comparing files byte-by-byte isn't slow (you only need to
do it to files that are identical in size, and you need to read
all the files anyway).

I also think we've now got enough of duplicate file finders in
Debian that it's time to consider whether we need so many. It's
too bad they all have incompatible command line syntaxes, or it
would be possible to drop some. (We should accept a new one if
it is better than the existing ones, of course. Evidence required.)

--
Freedom-based blog/wiki/web hosting: http://www.branchable.com/
 
Old 01-17-2012, 08:30 AM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> real user system max RSS elapsed cmd
> (s) (s) (s) (KiB) (s)
> 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null

And fdupes on the same set of files?

> Personally, I would be wary of using checksums for file comparisons,
> since comparing files byte-by-byte isn't slow (you only need to
> do it to files that are identical in size, and you need to read
> all the files anyway).

In some cases you may have a lot of files with identical size, so at
least a simple SSE-prone thing like crc is useful.

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117093020.GA4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117093020.GA4320@type.bordeaux.inria.fr
 
Old 01-17-2012, 09:45 AM
Lars Wirzenius
 
Default Bug#656142: ITP: duff -- Duplicate file finder

On Tue, Jan 17, 2012 at 10:30:20AM +0100, Samuel Thibault wrote:
> Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> > real user system max RSS elapsed cmd
> > (s) (s) (s) (KiB) (s)
> > 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> > 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> > 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
>
> And fdupes on the same set of files?

real user system max RSS elapsed cmd
(s) (s) (s) (KiB) (s)
3.1 2.4 5.5 62784 5.5 hardlink --dry-run files > /dev/null
1.1 0.4 1.6 15392 1.6 rdfind files > /dev/null
1.3 0.9 2.2 13936 2.2 fdupes -r -q files > /dev/null
1.9 0.2 2.1 9904 2.1 duff-0.5/src/duff -r files > /dev/null

Someone should run the benchmark on a large set of data, preferably
on various kinds of real data, rather than my small synthetic data set.
(I have, alas, neither the time nor the hardware to do that.)

> > Personally, I would be wary of using checksums for file comparisons,
> > since comparing files byte-by-byte isn't slow (you only need to
> > do it to files that are identical in size, and you need to read
> > all the files anyway).
>
> In some cases you may have a lot of files with identical size, so at
> least a simple SSE-prone thing like crc is useful.

That's a good point. However, the pathological case would need to
be quite pathological, since you can check around a thousand files
of the same time at the same time (i.e., the number of open files
per process), which is fairly rare for most people. But not all
people, of course.

--
Freedom-based blog/wiki/web hosting: http://www.branchable.com/
 
Old 01-17-2012, 10:03 AM
Samuel Thibault
 
Default Bug#656142: ITP: duff -- Duplicate file finder

Lars Wirzenius, le Tue 17 Jan 2012 10:45:20 +0000, a écrit :
> > > Personally, I would be wary of using checksums for file comparisons,
> > > since comparing files byte-by-byte isn't slow (you only need to
> > > do it to files that are identical in size, and you need to read
> > > all the files anyway).
> >
> > In some cases you may have a lot of files with identical size, so at
> > least a simple SSE-prone thing like crc is useful.
>
> That's a good point. However, the pathological case would need to
> be quite pathological, since you can check around a thousand files
> of the same time at the same time (i.e., the number of open files
> per process), which is fairly rare for most people. But not all
> people, of course.

I'm not sure to understand what you mean exactly. If you have even
just a hundred files of the same size, you will need ten thousand file
comparisons! Using a hash reduces that to indexing the hundred file
hashes.

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117110341.GL4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117110341.GL4320@type.bordeaux.inria.fr
 

Thread Tools




All times are GMT. The time now is 06:29 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org