FAQ Search Today's Posts Mark Forums Read

» Linux Archive
Home
New Posts
Search
FAQ


Go Back   Linux Archive > Debian > Debian dpkg

 
 
LinkBack Thread Tools
 
Old 06-21-2008, 06:09 AM
Zenaan Harkness
 
Default git-style file storage for .deb

I've read a few comments along the lines of the following:
http://209.85.141.104/search?q=cache:LkSwhS5wzn0J:madism.org/~madcoder/tmp/git-nopause.pdf+dpkg+git+repository&hl=en&ct=clnk&cd=1 2&gl=au&lr=lang_en&client=firefox-a
"GIT storage is very efficient and optimized. Some numbers:
- xorg-xserver.git, goes back to 2000, is 20MB big. The last orig.tar.gz
is 8MB big, more than 84MB unpacked.
- dpkg.git, whole history since April 1996, generates a git pack of
15MB. The last dpkg release is 17MB big unpacked.
- GNU libc version 2.7 weights 115MB unpacked. The full glibc history
(starts in the eighties) generates a GIT pack of 104MB.
Though, this won’t probably be true for packages with a lot of binary
stuff in it, where delta compression is less likely to produce good
results"

I've had the thought a few times that it could make sense to store a
repo's files in a git hierarchy, rather than in a package pool.

As in, raw files, with package description files which lookup the SHA
for each file in the package, when a package is installed.

Points for consideration:
- overlap of identical files (benefit)
- this can work inter-release and inter-distro
(debian ubuntu, even * *)
- different low level storage, and transfer protocols (changes)
- package storage - as git patch perhaps?
- package download
- higher level tools may continue to use lower level tools transparently
- similar for package src storage (already underway with git deb stuff
happening)
- with some extra tools, could provide the ultimate gentoo-envy fix

My primary thought is that repository size might be drastically reduced.
Perhaps some md5sum numbers could be run to test this.

Hope this is not too OT.

Zen

--
Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org
Please respect the confidentiality of this email as sensibly warranted.


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-23-2008, 09:25 PM
Phillip Susi
 
Default git-style file storage for .deb

Zenaan Harkness wrote:

I've had the thought a few times that it could make sense to store a
repo's files in a git hierarchy, rather than in a package pool.

As in, raw files, with package description files which lookup the SHA
for each file in the package, when a package is installed.


The package files are compressed tar archives, and because of this, they
are binary files which alter radically between versions, and thus, would
have horrible delta compression.



My primary thought is that repository size might be drastically reduced.
Perhaps some md5sum numbers could be run to test this.


It would not be reduced by much and would have a tremendous overhead to
access as a result, which the mirrors could never handle, and it would
break backwards compatibility since http or ftp could no longer be used
to fetch packages.




--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-24-2008, 04:45 AM
Zenaan Harkness
 
Default git-style file storage for .deb

On Mon, Jun 23, 2008 at 05:25:00PM -0400, Phillip Susi wrote:
> Zenaan Harkness wrote:
>> I've had the thought a few times that it could make sense to store a
>> repo's files in a git hierarchy, rather than in a package pool.
>> As in, raw files, with package description files which lookup the SHA
>> for each file in the package, when a package is installed.
>
> The package files are compressed tar archives, and because of this, they
> are binary files which alter radically between versions, and thus, would
> have horrible delta compression.

Sorry, what I mean is, separate files, not tarred and gzipped.

This way, when a package is upgraded, only those files which changed
would have new git-sha1 ref'ed files, files that are the same as before,
share the same sha ref and are therefore identical and require no extra
storage.

This is where my thought of inter-distribution shared repositories came
in - eg distributions sharing the same XOrg release, would have a lot of
shared files (ie identical files) which would share sha signatures, and
therefore be the one and the same file in the git repo.

But of course, it would rely on files being stored individually, not
inside .deb or .rpm packages.

I'm wondering whether someone has the technical know-how to do a
comparison, eg. between a couple of Debian or other distribution
versions - as it, how many files are shared for example between say
Ubuntu 7.10 and 8.04.

There might be more savings to be had for sources, rather than binaries.
And HTTP download of binaries as individual files might have more
overhead than downloading as packages. But it might be quicker, since no
unpacking at other end - does anyone know?

>> My primary thought is that repository size might be drastically reduced.
>> Perhaps some md5sum numbers could be run to test this.
>
> It would not be reduced by much and would have a tremendous overhead to
> access as a result, which the mirrors could never handle, and it would
> break backwards compatibility since http or ftp could no longer be used to
> fetch packages.

With an appropriate git config/setup, I'm pretty sure http and ftp
access is just fine, (my knowledge is relatively limited on this subject
though).

--
Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org
Please respect the confidentiality of this email as sensibly warranted.


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-24-2008, 09:29 AM
Goswin von Brederlow
 
Default git-style file storage for .deb

Zenaan Harkness <zen@freedbms.net> writes:

> I'm wondering whether someone has the technical know-how to do a
> comparison, eg. between a couple of Debian or other distribution
> versions - as it, how many files are shared for example between say
> Ubuntu 7.10 and 8.04.

There would be no saving for most architectures and doubtfull that
there is much for binaries as they depend on the gcc used to build.

So you are left with shared files which are usualy small. Larger
shared data is also in arch:all packages which a minor chunk compared
to the archs added together.

Further more file are in different locations in different
distributions making it hard to delat the right files against each
other.

So percentage wise and for debian you can't expect much savings.


But look at a different scenario: sid updates.

Some sid packages are build frequently, meaning the same compiler is
used. Libraries also don't change so much. That should give you a much
better delta compared to other distributions. Also arch:all packages
are rebuild for every source upload but are probably near identical.

So if you want to pursue this then rather look into that than across
distributions.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-25-2008, 08:04 PM
Phillip Susi
 
Default git-style file storage for .deb

Zenaan Harkness wrote:

Sorry, what I mean is, separate files, not tarred and gzipped.

This way, when a package is upgraded, only those files which changed
would have new git-sha1 ref'ed files, files that are the same as before,
share the same sha ref and are therefore identical and require no extra
storage.

This is where my thought of inter-distribution shared repositories came
in - eg distributions sharing the same XOrg release, would have a lot of
shared files (ie identical files) which would share sha signatures, and
therefore be the one and the same file in the git repo.

But of course, it would rely on files being stored individually, not
inside .deb or .rpm packages.

I'm wondering whether someone has the technical know-how to do a
comparison, eg. between a couple of Debian or other distribution
versions - as it, how many files are shared for example between say
Ubuntu 7.10 and 8.04.

There might be more savings to be had for sources, rather than binaries.
And HTTP download of binaries as individual files might have more
overhead than downloading as packages. But it might be quicker, since no
unpacking at other end - does anyone know?


This would mean that the server would have to decompress the pack file,
undelitfy the file, recompress it, and transmit it to the client, for
every file in every package. That load would be several orders of
magnitude higher than current.


It would not be reduced by much and would have a tremendous overhead to
access as a result, which the mirrors could never handle, and it would
break backwards compatibility since http or ftp could no longer be used to
fetch packages.


With an appropriate git config/setup, I'm pretty sure http and ftp
access is just fine, (my knowledge is relatively limited on this subject
though).


For http/ftp access to still work, the repository must not be stored in
packed+pruned form, which negates the space savings you are interested
in, as well as only allowing access to the current version. In fact, it
would use a _lot_ more space than it does now.




--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-25-2008, 08:17 PM
Phillip Susi
 
Default git-style file storage for .deb

Goswin von Brederlow wrote:

Further more file are in different locations in different
distributions making it hard to delat the right files against each
other.


FYI: git does actually handle this just fine.



--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-25-2008, 11:34 PM
Goswin von Brederlow
 
Default git-style file storage for .deb

Phillip Susi <psusi@cfl.rr.com> writes:

> Goswin von Brederlow wrote:
>> Further more file are in different locations in different
>> distributions making it hard to delat the right files against each
>> other.
>
> FYI: git does actually handle this just fine.

If you have 2 similar files, not identical, in different locations?
I highly doubt that.

If the files are identical then git will see the same hash for both
and only store one copy. But not if the differ slightly.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-29-2008, 10:26 PM
Zenaan Harkness
 
Default git-style file storage for .deb

On Wed, Jun 25, 2008 at 04:04:08PM -0400, Phillip Susi wrote:
> Zenaan Harkness wrote:
>> Sorry, what I mean is, separate files, not tarred and gzipped.

> This would mean that the server would have to decompress the pack file,
> undelitfy the file, recompress it, and transmit it to the client, for every
> file in every package. That load would be several orders of magnitude
> higher than current.

Are git servers experiencing high loads relative to the volume of data
sent?

I thought they were pretty efficient.

It is entirely possible that the idea is completely brain dead and I
have absolutely no idea.

Apologies if I'm barking up the wrong tree...


>>> It would not be reduced by much and would have a tremendous overhead to
>>> access as a result, which the mirrors could never handle, and it would
>>> break backwards compatibility since http or ftp could no longer be used
>>> to fetch packages.
>> With an appropriate git config/setup, I'm pretty sure http and ftp
>> access is just fine, (my knowledge is relatively limited on this subject
>> though).
>
> For http/ftp access to still work, the repository must not be stored in
> packed+pruned form, which negates the space savings you are interested in,
> as well as only allowing access to the current version. In fact, it would
> use a _lot_ more space than it does now.

I was assuming possibly some modifications to underlying tools/
transports.

But again, I have no idea if what I had in mind is at all possible.
Sounds like it's not possible at all, and that I am clearly
misunderstanding stuff. I'm still not getting it.

Hope I haven't wasted too much time...

zen

--
Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org
Please respect the confidentiality of this email as sensibly warranted.


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-30-2008, 09:30 PM
Phillip Susi
 
Default git-style file storage for .deb

Goswin von Brederlow wrote:

If you have 2 similar files, not identical, in different locations?
I highly doubt that.

If the files are identical then git will see the same hash for both
and only store one copy. But not if the differ slightly.


You tell git when you move a file and it records the fact in the change
record. Because of this it knows the predecessor file even though it
has changed location, and can properly diff with it.



--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 06-30-2008, 10:48 PM
Joey Hess
 
Default git-style file storage for .deb

Phillip Susi wrote:
> You tell git when you move a file and it records the fact in the change
> record.

No, that's how every VCS *except* git works.

--
see shy jo
 

Thread Tools




All times are GMT. The time now is 11:51 PM.

VBulletin, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org