FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Development

 
 
LinkBack Thread Tools
 
Old 08-15-2011, 05:33 PM
Lars Wirzenius
 
Default /usr/share/doc/ files and gzip/xz/no compression

On Mon, Aug 15, 2011 at 05:16:55PM +0900, Charles Plessy wrote:
> Le Mon, Aug 15, 2011 at 01:48:50AM +0200, Adam Borowski a écrit :
> >
> > * A year ago, I repacked CD1, .xz took 66% space needed by .gz. This time,
> > on the whole archive, gains are somewhat smaller: 72%. I guess that CD1
> > is code-heavy while packages of lower priorities tend to have more data.
>
> Also, many files in /usr/share/doc are gzipped as per §12.3; that can prevent
> to get the full benefit of xz compression. In some of my packages containing
> mostly such files, the benefit of switching to xz is almost null. I wonder if
> it still makes sense to compress these files by default:
>
> - Most systems have enough space to keep them uncompressed,
> - others systems just do not install these files,
> - some filesystems are compressed on the fly,
> - the binary packages themselves are compressed.

On the other hand, many computers now have an SSD drive, for speed,
which is relatively small. Further, most users will likely need files in
/usr/share/doc rarely, if ever, so not compressing things risks wasting
a bunch of disk space for no particular benefit.

To get some actual numbers, I wrote the attached script. On my laptop
running squeeze, it reports:

Total size of *.gz files in /usr/share/doc: 170542915
Total size of uncompressed *.gz files in /usr/share/doc: 611945610
Total size of *.gz files in /usr/share/doc converted into *.xz: 140588208

That indicates that compressing documentation with xz instead of gz
does not save a whole lot (but does save some), but not compressing at
all wastes a lot. Putting the numbers into a table for easier comparison:

raw gz xz
584 163 134 file sizes (MiB)
0 421 450 savings compared to raw (MiB)
-421 0 29 savings compared to current gz (MiB)

So I would definitely vote for continuing to compress files in
/usr/share/doc. (Note that these numbers cover only files that are
currently *.gz, not all files in /usr/share/doc. See script for
details.)

I'm OK with allowing use of xz for compressing the files.

--
Freedom-based blog/wiki/web hosting: http://www.branchable.com/
#!/bin/sh

set -e

gzsum=$(find /usr/share/doc -type f -name '*.gz' -printf '%s
' |
awk '{ s += $1 } END { print s }')
echo "Total size of *.gz files in /usr/share/doc: $gzsum"

rawsum=$(find /usr/share/doc -type f -name '*.gz' -exec zcat '{}' + | wc -c)
echo "Total size of uncompressed *.gz files in /usr/share/doc: $rawsum"

xzsum=$(find /usr/share/doc -type f -name '*.gz' -print0 |
xargs -0n1 -I'{}' -- sh -c 'zcat {} | xz | wc -c' |
awk '{ s += $1 } END { print s }')
echo "Total size of *.gz files in /usr/share/doc converted into *.xz: $xzsum"
 
Old 08-15-2011, 05:41 PM
Andreas Barth
 
Default /usr/share/doc/ files and gzip/xz/no compression

* Lars Wirzenius (liw@liw.fi) [110815 19:36]:
> 584 163 134 file sizes (MiB)

Thanks for comparing these numbers. That tells me that at least in the
average case we just can continue with gz, and not care much about the
relativly small difference to xz.


Andi


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815174146.GC15003@mails.so.argh.org">http://lists.debian.org/20110815174146.GC15003@mails.so.argh.org
 
Old 08-15-2011, 07:15 PM
Lucas Nussbaum
 
Default /usr/share/doc/ files and gzip/xz/no compression

On 15/08/11 at 19:41 +0200, Andreas Barth wrote:
> * Lars Wirzenius (liw@liw.fi) [110815 19:36]:
> > 584 163 134 file sizes (MiB)
>
> Thanks for comparing these numbers. That tells me that at least in the
> average case we just can continue with gz, and not care much about the
> relativly small difference to xz.

I wouldn't call -20% a relatively small difference.

The question is: are there downsides to switching to xz for
/usr/share/doc?

For example, in the past, some PDF readers could not directly open
.pdf.gz files, so one had to uncompress the file manually first, which
was pretty annoying. I just checked, and evince, okular and xpdf can
open .pdf.xz files. But maybe other tools have similar issues.

- Lucas


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815191525.GA1553@xanadu.blop.info">http://lists.debian.org/20110815191525.GA1553@xanadu.blop.info
 
Old 08-15-2011, 09:04 PM
Carsten Hey
 
Default /usr/share/doc/ files and gzip/xz/no compression

* Lars Wirzenius [2011-08-15 18:33 +0100]:
> raw gz xz
> 584 163 134 file sizes (MiB)
> 0 421 450 savings compared to raw (MiB)
> -421 0 29 savings compared to current gz (MiB)

Years ago I compared sizes of compressed files in /usr/share/doc using
different compression methods too, possibly restricting to specific file
types (for example changelog and copyright).

> I'm OK with allowing use of xz for compressing the files.

IIRC bzip2 had a better compression. Compressing dpkg's changelog on
stable seems confirm this:

$ zcat /usr/share/doc/dpkg/changelog.gz | bzip2 | wc -c
145586
$ zcat /usr/share/doc/dpkg/changelog.gz | xz | wc -c
167844


Regards
Carsten


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815210451.GA25791@furrball.stateful.de">http://lists.debian.org/20110815210451.GA25791@furrball.stateful.de
 
Old 08-15-2011, 09:25 PM
Lars Wirzenius
 
Default /usr/share/doc/ files and gzip/xz/no compression

On Mon, Aug 15, 2011 at 11:04:51PM +0200, Carsten Hey wrote:
> * Lars Wirzenius [2011-08-15 18:33 +0100]:
> > raw gz xz
> > 584 163 134 file sizes (MiB)
> > 0 421 450 savings compared to raw (MiB)
> > -421 0 29 savings compared to current gz (MiB)
>
> Years ago I compared sizes of compressed files in /usr/share/doc using
> different compression methods too, possibly restricting to specific file
> types (for example changelog and copyright).
>
> > I'm OK with allowing use of xz for compressing the files.
>
> IIRC bzip2 had a better compression. Compressing dpkg's changelog on

Adding bzip2 support to my script and re-running gives me

Total size of *.gz files in /usr/share/doc converted into *.bz2: 135179633

In other words, it's 130 MiB against xz's 134 MiB. I'll leave it to
others to decide if it's a significatn difference.

--
Freedom-based blog/wiki/web hosting: http://www.branchable.com/


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815212515.GA24838@havelock.liw.fi">http://lists.debian.org/20110815212515.GA24838@havelock.liw.fi
 
Old 08-15-2011, 09:59 PM
Andreas Barth
 
Default /usr/share/doc/ files and gzip/xz/no compression

* Lars Wirzenius (liw@liw.fi) [110815 23:27]:
> On Mon, Aug 15, 2011 at 11:04:51PM +0200, Carsten Hey wrote:
> > * Lars Wirzenius [2011-08-15 18:33 +0100]:
> > > raw gz xz
> > > 584 163 134 file sizes (MiB)
> > > 0 421 450 savings compared to raw (MiB)
> > > -421 0 29 savings compared to current gz (MiB)

> In other words, it's 130 MiB against xz's 134 MiB. I'll leave it to
> others to decide if it's a significatn difference.

bzip2 is definitly a more conservative choice than xz. If it's
smaller, than it's superior to xz.


Andi


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815215907.GD15003@mails.so.argh.org">http://lists.debian.org/20110815215907.GD15003@mails.so.argh.org
 
Old 08-15-2011, 10:28 PM
Iustin Pop
 
Default /usr/share/doc/ files and gzip/xz/no compression

On Mon, Aug 15, 2011 at 11:59:07PM +0200, Andreas Barth wrote:
> * Lars Wirzenius (liw@liw.fi) [110815 23:27]:
> > On Mon, Aug 15, 2011 at 11:04:51PM +0200, Carsten Hey wrote:
> > > * Lars Wirzenius [2011-08-15 18:33 +0100]:
> > > > raw gz xz
> > > > 584 163 134 file sizes (MiB)
> > > > 0 421 450 savings compared to raw (MiB)
> > > > -421 0 29 savings compared to current gz (MiB)
>
> > In other words, it's 130 MiB against xz's 134 MiB. I'll leave it to
> > others to decide if it's a significatn difference.
>
> bzip2 is definitly a more conservative choice than xz. If it's
> smaller, than it's superior to xz.

AFAIK, bzip2 has much worse decompression performance than xz: I have
taken dpkg's changelog, concatenated it to itself 10 times (11MB size),
and:

gzip: 0.377s, down to 2.7MB
gunzip: 0.077s

bzip2: 1.45s, down to 1.8M
bunzip2: 0.420s

xz: 4.4s(!), down to 204K(!)
xz -d: 0.035s

So here bzip is an order of magnitude slower at decompression.

I've repeated the test on uncompressible data (/dev/urandom), 10MB, and
the numbers are even worse for bzip2:

gzip: 0.410s / 0.060s
bzip2: 2.400s / 0.960s
xz: 4.040s / 0.027s

So while xz is costly for compression, it's faster than even gzip for
decompression. bzip2's cost for decompresion (huge!) is what kept me
personally from using it seriously before xz appeared.

There is also information on Wikipedia about various compression
benchmarks, but IMHO if we want to switch from gzip then bzip2
doesn't make sense for /usr/share/doc.

regards,
iustin
 
Old 08-15-2011, 11:02 PM
Carsten Hey
 
Default /usr/share/doc/ files and gzip/xz/no compression

* Andreas Barth [2011-08-15 23:59 +0200]:
> * Lars Wirzenius (liw@liw.fi) [110815 23:27]:
> > On Mon, Aug 15, 2011 at 11:04:51PM +0200, Carsten Hey wrote:
> > > * Lars Wirzenius [2011-08-15 18:33 +0100]:
> > > > raw gz xz
> > > > 584 163 134 file sizes (MiB)
> > > > 0 421 450 savings compared to raw (MiB)
> > > > -421 0 29 savings compared to current gz (MiB)
>
> > In other words, it's 130 MiB against xz's 134 MiB. I'll leave it to
> > others to decide if it's a significatn difference.
>
> bzip2 is definitly a more conservative choice than xz. If it's
> smaller, than it's superior to xz.

bzip2 has a better compression on average for some filetypes, xz[1] has
a better compression on average for others:

gzip bzip2 xz bzip2+xz[3]
text files[2] 94312922 73496587 77783076 73496587
other files 16577181 14609893 14275484 14275484
sum 110890103 88106480 92058560 87772071

Among the "other files" are also a lot of text files, if we would
compress Debian packages instead, xz would win presumably.

Anyway, I don't think this difference of 4 MiB on a desktop system is
significant.


I would prefer to avoid bloating the set of pseudo essential packages
without a good reason and I think users should be able to decompress all
files in /u/s/d. There are plans to let dpkg depend on liblzma2 instead
of xz and it already depends on libbz2-1.0. If dpkg's dependency on
libbz2 is planned to be removed in future, I would prefer to let libbz2
vanish from the pseudo essential set and use xz also for /u/s/d,
otherwise I would prefer using bzip2 over xz for /u/s/d.


Carsten


[1] I did not use -e nor -9, but the difference should not be that big
on files in /usr/share/doc.
[2] find ... -regex '.*(changelog|copyright|README|TODO|NEWS).*[.]gz'
[3] bzip2 for text files and xz for other files. This is of course
nothing we should consider doing.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110815230257.GA27976@furrball.stateful.de">http://lists.debian.org/20110815230257.GA27976@furrball.stateful.de
 
Old 08-16-2011, 12:43 AM
Russell Coker
 
Default /usr/share/doc/ files and gzip/xz/no compression

On Tue, 16 Aug 2011, Iustin Pop <iustin@debian.org> wrote:
> There is also information on Wikipedia about various compression
> benchmarks, but IMHO if we want to switch from gzip then bzip2
> doesn't make sense for /usr/share/doc.

I'd like to see zless work transparently with bzip and xz compressed files.
There's really no need for three different wrapper programs when the zless
program can just consult the magic db to determine which decompression program
to use.

A switch inevitably involves a period of time where we have a mixture of
compression methods in use. Even after that there will be a variety of old
data (I'm sure that I'm not the only person who has been using gzip for most
things because it's usually good enough and it's a matter of habit). As most
Unix commands aren't consciously typed having multiple variants of zless would
be a real drag.

--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201108161043.04786.russell@coker.com.au">http://lists.debian.org/201108161043.04786.russell@coker.com.au
 
Old 08-16-2011, 01:31 AM
Ben Hutchings
 
Default /usr/share/doc/ files and gzip/xz/no compression

On Tue, 2011-08-16 at 10:43 +1000, Russell Coker wrote:
> On Tue, 16 Aug 2011, Iustin Pop <iustin@debian.org> wrote:
> > There is also information on Wikipedia about various compression
> > benchmarks, but IMHO if we want to switch from gzip then bzip2
> > doesn't make sense for /usr/share/doc.
>
> I'd like to see zless work transparently with bzip and xz compressed files.
> There's really no need for three different wrapper programs when the zless
> program can just consult the magic db to determine which decompression program
> to use.
[...]

+1

After all, it isn't called gzless.

Ben.
 

Thread Tools




All times are GMT. The time now is 02:37 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org