Package: wnpp
Severity: wishlist
Owner: Kamal Mostafa <kamal@whence.com>
* Package name : duff
Version : 0.5
Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
* URL : http://duff.sourceforge.net/
* License : Zlib
Programming Lang: C
Description : Duplicate file finder
Duff is a command-line utility for identifying duplicates in a given set of
files. It attempts to be usably fast and uses the SHA family of message
digests as a part of the comparisons.
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116205813.24274.12515.reportbug@localhost6.lo caldomain6">http://lists.debian.org/20120116205813.24274.12515.reportbug@localhost6.lo caldomain6
01-16-2012, 08:03 PM
Samuel Thibault
Bug#656142: ITP: duff -- Duplicate file finder
Kamal Mostafa, le Mon 16 Jan 2012 12:58:13 -0800, a écrit :
> Package: wnpp
> Severity: wishlist
> Owner: Kamal Mostafa <kamal@whence.com>
>
>
> * Package name : duff
> Version : 0.5
> Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
> * URL : http://duff.sourceforge.net/
> * License : Zlib
> Programming Lang: C
> Description : Duplicate file finder
>
> Duff is a command-line utility for identifying duplicates in a given set of
> files. It attempts to be usably fast and uses the SHA family of message
> digests as a part of the comparisons.
What is it the benefit over fdupes, rdfind, ...?
Samuel
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116210316.GS4158@type.famille.thibault.fr">ht tp://lists.debian.org/20120116210316.GS4158@type.famille.thibault.fr
01-16-2012, 08:32 PM
Axel Beckert
Bug#656142: ITP: duff -- Duplicate file finder
Hi,
Samuel Thibault wrote:
> > * Package name : duff
> > Version : 0.5
> > Upstream Author : Camilla Berglund <elmindreda@elmindreda.org>
> > * URL : http://duff.sourceforge.net/
> > * License : Zlib
> > Programming Lang: C
> > Description : Duplicate file finder
> >
> > Duff is a command-line utility for identifying duplicates in a given set of
> > files. It attempts to be usably fast and uses the SHA family of message
> > digests as a part of the comparisons.
>
> What is it the benefit over fdupes, rdfind, ...?
..., hardlink, ...
Some of my coworkers prefer duff over the tools available in Debian,
too. I'm though no more sure why, but it's possible that speed was one
argument, because they ran it over several TB of data. Will check
what exactly was the reason back then.
Was thinking about packaging it myself already, so I may also sponsor
Kamal's package when it's ready.
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120116213236.GY2744@sym.noone.org">http://lists.debian.org/20120116213236.GY2744@sym.noone.org
01-16-2012, 09:07 PM
Joerg Jaspert
Bug#656142: ITP: duff -- Duplicate file finder
>> What is it the benefit over fdupes, rdfind, ...?
> ..., hardlink, ...
finddup from perforate
> Was thinking about packaging it myself already, so I may also sponsor
> Kamal's package when it's ready.
You just listed the third duplicate (and me no. 4), and still go blind
right on "ohoh, i sponsor it". Why? I hope its conditional on it being
vastly better than any of the others (speed, functionality, ...) and not
just "because".
Contrary to some common believe, Debian is not the dump for NIH, and
even if a little redundancy can't hurt, too much is just waste. Of our
time, of our mirrors (space and bandwidth), ...
--
bye, Joerg
Contrary to common belief, Arch:i386 is *not* the same as Arch: any.
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87hazvb8sg.fsf@gkar.ganneff.de">http://lists.debian.org/87hazvb8sg.fsf@gkar.ganneff.de
01-16-2012, 10:49 PM
Kamal Mostafa
Bug#656142: ITP: duff -- Duplicate file finder
On Mon, 2012-01-16 at 23:07 +0100, Joerg Jaspert wrote:
> >> What is it the benefit over fdupes, rdfind, ...?
> > ..., hardlink, ...
> finddup from perforate
After a quick evaluation of the various "find dupe files" tools, I was
attracted to try duff because:
1. It looked easier to use than the others.
2. This quote from its website[1] was exactly what I was looking for:
"Note that duff itself never modifies any files, but it's designed to
play nice with tools that do." The other dupe cleaner utilities left me
worried that they might trash something important if I got my command
line options wrong or forgot a --dry-run flag.
> > Was thinking about packaging it myself already, so I may also sponsor
> > Kamal's package when it's ready.
Thanks Axel, but I'm a DD myself, so won't need a sponsor.
> You just listed the third duplicate (and me no. 4), and still go blind
> right on "ohoh, i sponsor it". Why? I hope its conditional on it being
> vastly better than any of the others (speed, functionality, ...)
In my humble opinion, that would be an unreasonable pre-condition for
inclusion in Debian. Our standard for inclusion should not be that a
new package must be "vastly better" than other similar packages. That
would deny a new package the opportunity to build a user base and
possibly someday evolve to become the "vastly better" alternative
itself.
-Kamal
kamal@whence.com
kamal@debian.org
[1] http://duff.sourceforge.net/
01-17-2012, 06:42 AM
martin f krafft
Bug#656142: ITP: duff -- Duplicate file finder
also sprach Kamal Mostafa <kamal@debian.org> [2012.01.17.0049 +0100]:
> In my humble opinion, that would be an unreasonable pre-condition for
> inclusion in Debian. Our standard for inclusion should not be that a
> new package must be "vastly better" than other similar packages. That
> would deny a new package the opportunity to build a user base and
> possibly someday evolve to become the "vastly better" alternative
> itself.
Right, but I'd say it needs to be better and the maintainer needs to
be able to argue how it is better.
--
.'`. martin f. krafft <madduck@d.o> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems
"die zeit für kleine politik ist vorbei.
schon das nächste jahrhundert
bringt den kampf um die erdherrschaft."
- friedrich nietzsche
01-17-2012, 08:12 AM
Lars Wirzenius
Bug#656142: ITP: duff -- Duplicate file finder
On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
> * Package name : duff
> * URL : http://duff.sourceforge.net/
A quick speed comparison:
real user system max RSS elapsed cmd
(s) (s) (s) (KiB) (s)
3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
rdfind seems to be quickest one, but duff compares well with hardlink,
which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
Debian so far.
This was done using my benchmark-cmd utility in my extrautils
collection (not in Debian): http://liw.fi/extrautils/ for source.
The exact command to generate the above table:
Personally, I would be wary of using checksums for file comparisons,
since comparing files byte-by-byte isn't slow (you only need to
do it to files that are identical in size, and you need to read
all the files anyway).
I also think we've now got enough of duplicate file finders in
Debian that it's time to consider whether we need so many. It's
too bad they all have incompatible command line syntaxes, or it
would be possible to drop some. (We should accept a new one if
it is better than the existing ones, of course. Evidence required.)
Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> real user system max RSS elapsed cmd
> (s) (s) (s) (KiB) (s)
> 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
And fdupes on the same set of files?
> Personally, I would be wary of using checksums for file comparisons,
> since comparing files byte-by-byte isn't slow (you only need to
> do it to files that are identical in size, and you need to read
> all the files anyway).
In some cases you may have a lot of files with identical size, so at
least a simple SSE-prone thing like crc is useful.
Samuel
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117093020.GA4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117093020.GA4320@type.bordeaux.inria.fr
01-17-2012, 09:45 AM
Lars Wirzenius
Bug#656142: ITP: duff -- Duplicate file finder
On Tue, Jan 17, 2012 at 10:30:20AM +0100, Samuel Thibault wrote:
> Lars Wirzenius, le Tue 17 Jan 2012 09:12:58 +0000, a écrit :
> > real user system max RSS elapsed cmd
> > (s) (s) (s) (KiB) (s)
> > 3.2 2.4 5.8 62784 5.8 hardlink --dry-run files > /dev/null
> > 1.1 0.4 1.6 15424 1.6 rdfind files > /dev/null
> > 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null
>
> And fdupes on the same set of files?
real user system max RSS elapsed cmd
(s) (s) (s) (KiB) (s)
3.1 2.4 5.5 62784 5.5 hardlink --dry-run files > /dev/null
1.1 0.4 1.6 15392 1.6 rdfind files > /dev/null
1.3 0.9 2.2 13936 2.2 fdupes -r -q files > /dev/null
1.9 0.2 2.1 9904 2.1 duff-0.5/src/duff -r files > /dev/null
Someone should run the benchmark on a large set of data, preferably
on various kinds of real data, rather than my small synthetic data set.
(I have, alas, neither the time nor the hardware to do that.)
> > Personally, I would be wary of using checksums for file comparisons,
> > since comparing files byte-by-byte isn't slow (you only need to
> > do it to files that are identical in size, and you need to read
> > all the files anyway).
>
> In some cases you may have a lot of files with identical size, so at
> least a simple SSE-prone thing like crc is useful.
That's a good point. However, the pathological case would need to
be quite pathological, since you can check around a thousand files
of the same time at the same time (i.e., the number of open files
per process), which is fairly rare for most people. But not all
people, of course.
Lars Wirzenius, le Tue 17 Jan 2012 10:45:20 +0000, a écrit :
> > > Personally, I would be wary of using checksums for file comparisons,
> > > since comparing files byte-by-byte isn't slow (you only need to
> > > do it to files that are identical in size, and you need to read
> > > all the files anyway).
> >
> > In some cases you may have a lot of files with identical size, so at
> > least a simple SSE-prone thing like crc is useful.
>
> That's a good point. However, the pathological case would need to
> be quite pathological, since you can check around a thousand files
> of the same time at the same time (i.e., the number of open files
> per process), which is fairly rare for most people. But not all
> people, of course.
I'm not sure to understand what you mean exactly. If you have even
just a hundred files of the same size, you will need ten thousand file
comparisons! Using a hash reduces that to indexing the hundred file
hashes.
Samuel
--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120117110341.GL4320@type.bordeaux.inria.fr">http ://lists.debian.org/20120117110341.GL4320@type.bordeaux.inria.fr