FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > ArchLinux > ArchLinux Pacman Development

 
 
LinkBack Thread Tools
 
Old 02-23-2009, 09:02 AM
Xavier
 
Default delta support in libalpm

On Mon, Feb 23, 2009 at 9:57 AM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
> Xavier wrote:
>>
>> There has never been any real official interests for delta. This seems
>> to make a requirement the ability to make a separate delta server.
>> This seems to require a separate delta database. This implies a new
>> level of complexity and code bloat in pacman. Now maybe it is worth
>> it, I don't know, it still makes me wondering why we put all this
>> delta stuff in pacman to begin with. What was the problem with
>> XferCommand, it seemed like it was a great idea. Now that
>> wget-xdelta.sh script is just a toy, but a much more powerful python
>> script could be written that has basically the same logic as pacman
>> currently has + the ability to fetch and parse a separate delta
>> database.
>
> Unless the server is out of disk space, I'm not too sure exactly why there's
> a requirement for a separate server. If pacman is distributed with the delta
> option turned on by default, the server doing the actual "serving" of the
> updates is probably going to have 60 to 85% less work to do.
>
> I will grant that there would be a new level of complexity involved, for
> example, if I've missed 4 updates, we'd have to "chain link" the tar.gz in
> my cache via 4 delta patches to get the current tar.gz.
>
> I believe that the following would be the simplest implementation both in
> terms of how much implementation work is needed and the probable
> effectiveness:
> Put delta files into a separate folder (thus also avoiding a snapshot from
> containing the deltas):
> http://archlinux.mirror.ac.za/delta/core/os/x86_64/kernel26-2.6.28.4-1-x86_64.kernel26-2.6.28.5-1.pkg.xd3.tar.gz
> Thus, I could do the following (bash pseudocode)
> curl http://archlinux.mirror.ac.za/delta/core/os/x86_64/ > tmpfile
> grep $pkgname < tmpfile > listing
> failed=false
> cat listing | while read delta
> do
> [ $pkgname-$currentpkgversion-$pkgarch.xd3.tar.gz *within* $delta ] &&
> start=true
> if [ start=true ]
> then while read delta
> do
> wget http://archlinux.mirror.ac.za/delta/core/os/x86_64/$delta &&
> applydelta $delta $curfile
> [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break
> curfile=`ls -rt | tail -n 1`
> done
> fi
> [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break
> done
>
> The above requires no db implementation at all and can work well even using
> the above very simple logic.
> And yes, by my own standards, the above is very bad bash pseudo-code. :P
>
> Of the above, what is already implemented in pacman?
>

Everything is already implemented in pacman, with a more complex logic
(which might be totally useless after all)
For each package in a sync db, there is a deltas file besides the
depends and desc one which basically contains the list of deltas for
that package and their size. With this information, and the contents
of the filecache, it computes the shortest path (in term of download
size) to the final package.
That logic applied to an example :
if you have file v1 in your cache, you want to upgrade to v3, and
there are three deltas for this package : v1tov2 , v2tov3 and v1tov3
If v1tov2 + v2tov3 is smaller than v1tov3, it will download the first
two deltas and apply them to get v3. Otherwise it will download the
third one.

The problem of this implementation (besides being probably overkill)
is that it requires information in the sync databases. So either it
requires a big official effort to integrate this stuff and add deltas
to all the official databases. Otherwise, I don't know. You need to
fully mirror the repository you want to add deltas to, then you need
to generate deltas (maybe during mirror sync) and to add the deltas to
your database, and then host everything somewhere (the packages + the
deltas + the database with delta info).
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-23-2009, 10:27 AM
Brendan Hide
 
Default delta support in libalpm

b
Xavier wrote:

Everything is already implemented in pacman, with a more complex logic
(which might be totally useless after all)
For each package in a sync db, there is a deltas file besides the
depends and desc one which basically contains the list of deltas for
that package and their size. With this information, and the contents
of the filecache, it computes the shortest path (in term of download
size) to the final package.
That logic applied to an example :
if you have file v1 in your cache, you want to upgrade to v3, and
there are three deltas for this package : v1tov2 , v2tov3 and v1tov3
If v1tov2 + v2tov3 is smaller than v1tov3, it will download the first
two deltas and apply them to get v3. Otherwise it will download the
third one.

The problem of this implementation (besides being probably overkill)
is that it requires information in the sync databases. So either it
requires a big official effort to integrate this stuff and add deltas
to all the official databases. Otherwise, I don't know. You need to
fully mirror the repository you want to add deltas to, then you need
to generate deltas (maybe during mirror sync) and to add the deltas to
your database, and then host everything somewhere (the packages + the
deltas + the database with delta info).

This makes a lot more sense to me now. Thank you for the clarification,
Xavier. It is the most efficient way, end-user-wise, despite the
possibly-excessive metadata. It isn't necessarily efficient for the
server. :/


Looking at the logistics, the best time to make the delta is after the
new .pkg.tar.(gz|bz2) is uploaded to the repo. I assume this is also
about the time the db is updated. This could be implemented repo-wide as
packages are updated and delta'd without any individual package maker's
direct involvement in the delta process - a "passive" change that won't
need to change anyone's habits.


If you really want to be able to make lots of delta versions, ie,
v1tov2, v1tov3, v1tov4, v2tov3, v2tov4, v3tov4, then you'd probably have
to keep at least 4 older (full) versions that will take up a lot of disk
space - or you'll need to regenerate all the other versions - take up a
*lot* of IO / RAM / CPU during the generation of the new deltas.


If you only take v1tov2, v2tov3, v3tov4, you only need to keep v4 and
the 3 deltas. When v5 gets uploaded, you create v4tov5 and delete v4
from the server thus saving disk space. This is much simpler and more
implementable than the current "brief".


Mirror servers can mirror the old way - inefficiently - however they
should mirror the deltas across too. I guess that the mirror servers do
a lot less bandwidth from the official repository than the end users.


The net result I believe is a much simpler implementation despite
achieving 99% of the original brief's goal.


Your thoughts?

__________
Brendan Hide

_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-23-2009, 10:32 AM
Xavier
 
Default delta support in libalpm

On Mon, Feb 23, 2009 at 12:27 PM, Brendan Hide
<brendan@swiftspirit.co.za> wrote:
>
> This makes a lot more sense to me now. Thank you for the clarification,
> Xavier. It is the most efficient way, end-user-wise, despite the
> possibly-excessive metadata. It isn't necessarily efficient for the server.
> :/
>
> Looking at the logistics, the best time to make the delta is after the new
> .pkg.tar.(gz|bz2) is uploaded to the repo. I assume this is also about the
> time the db is updated. This could be implemented repo-wide as packages are
> updated and delta'd without any individual package maker's direct
> involvement in the delta process - a "passive" change that won't need to
> change anyone's habits.
>
> If you really want to be able to make lots of delta versions, ie, v1tov2,
> v1tov3, v1tov4, v2tov3, v2tov4, v3tov4, then you'd probably have to keep at
> least 4 older (full) versions that will take up a lot of disk space - or
> you'll need to regenerate all the other versions - take up a *lot* of IO /
> RAM / CPU during the generation of the new deltas.
>
> If you only take v1tov2, v2tov3, v3tov4, you only need to keep v4 and the 3
> deltas. When v5 gets uploaded, you create v4tov5 and delete v4 from the
> server thus saving disk space. This is much simpler and more implementable
> than the current "brief".
>
> Mirror servers can mirror the old way - inefficiently - however they should
> mirror the deltas across too. I guess that the mirror servers do a lot less
> bandwidth from the official repository than the end users.
>
> The net result I believe is a much simpler implementation despite achieving
> 99% of the original brief's goal.
>
> Your thoughts?
>

These were my first thoughts, but here is how Garns answered to them :
""
In a previous mail Xavier toyed with the idea to put delta creation
into repo-add, I have given this some thought, as it seems nice in
principle, but there are drawbacks. For Arch this would mean creating
deltas on Gerolde, which seems to be fairly strained already,
according to the dev list. Furthermore this introduces some new
variables to repo-add (at least repo location and an output location)
this would be manageable, but doesn't look very nice.
""
http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.html
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-23-2009, 10:43 AM
Brendan Hide
 
Default delta support in libalpm

Xavier wrote:

how Garns answered to them:
...
For Arch this would mean creating deltas on Gerolde, which seems to
be fairly strained already.
...
http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm
Is Gerolde separate from the server that serves the FTP and HTTP
traffic? If it is
separate then I can't argue for the delta's improvement on the server's
performance.

If it *is* the same server then Garn's argument is illogical.

What else is Gerolde doing for Arch and can it be moved to another server?

__________
Brendan Hide

_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-23-2009, 10:58 AM
Xavier
 
Default delta support in libalpm

On Mon, Feb 23, 2009 at 12:43 PM, Brendan Hide
<brendan@swiftspirit.co.za> wrote:
> Xavier wrote:
>>
>> how Garns answered to them:
>> ...
>> For Arch this would mean creating deltas on Gerolde, which seems to
>> be fairly strained already.
>> ...
>> http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm
>
> Is Gerolde separate from the server that serves the FTP and HTTP traffic? If
> it is
> separate then I can't argue for the delta's improvement on the server's
> performance.
> If it *is* the same server then Garn's argument is illogical.
>
> What else is Gerolde doing for Arch and can it be moved to another server?
>

This is not the only problem. Another big problem is that it would
require real interest and work from official developers, and this is
clearly inexistent
For example, dbscripts would require some work as well
http://projects.archlinux.org/?p=dbscripts.git;a=tree

If these two problems can be fixed, we still have some technical
issues about how adding deltas to a database with repo-add should
work.
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-23-2009, 04:48 PM
Aaron Griffin
 
Default delta support in libalpm

On Mon, Feb 23, 2009 at 5:58 AM, Xavier <shiningxc@gmail.com> wrote:
> On Mon, Feb 23, 2009 at 12:43 PM, Brendan Hide
> <brendan@swiftspirit.co.za> wrote:
>> Xavier wrote:
>>>
>>> how Garns answered to them:
>>> ...
>>> For Arch this would mean creating deltas on Gerolde, which seems to
>>> be fairly strained already.
>>> ...
>>> http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm
>>
>> Is Gerolde separate from the server that serves the FTP and HTTP traffic? If
>> it is
>> separate then I can't argue for the delta's improvement on the server's
>> performance.
>> If it *is* the same server then Garn's argument is illogical.
>>
>> What else is Gerolde doing for Arch and can it be moved to another server?

Gerolde does everything - every service that has an archlinux.org
domain name is hosted on gerolde (except ftp.archlinux.org). It can't
be "moved to another server" because we don't have another and don't
have the finances to get another at this time, nor do we have the
manpower to maintain multiple servers.

> This is not the only problem. Another big problem is that it would
> require real interest and work from official developers, and this is
> clearly inexistent
> For example, dbscripts would require some work as well
> http://projects.archlinux.org/?p=dbscripts.git;a=tree

I wouldn't say the interest is non-existent, it's just that the
implementation is so complex at this point in time, and most of us are
of the opinion that "bandwidth is cheap", so we go the easier route.

Questions which make the implementation complex:
* When do we generate deltas? As part of the db scripts?
* How long do we keep them? 10 previous versions? 5?
* How much additional space is this going to take? How do we set it up
so that space-constrained mirrors can opt-out of the deltas?

I'm sure there's more, but that's just "off the cuff". In my eyes,
this is a complex change that doesn't really seem to benefit too many
people. If you download 3megs instead of 7, it's not that big of a
deal and has so many more points of failure to contend with.
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-26-2009, 09:19 PM
Xavier
 
Default delta support in libalpm

On Mon, Feb 23, 2009 at 6:48 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
>
> Questions which make the implementation complex:
> * When do we generate deltas? As part of the db scripts?

Well I think that would be practical. When a new package is being
added, grab the old one, generate a delta, and add it to the database.
This could be doable.

> * How long do we keep them? 10 previous versions? 5?

I would think 5 is more than enough. Allan suggested more complicated
ways of cleaning deltas, but we could indeed just use a simple limit
like that. There is still the problem of finding which are the 5
newest deltas to be kept.

> * How much additional space is this going to take? How do we set it up
> so that space-constrained mirrors can opt-out of the deltas?
>

That's a very good question I didn't consider. But well, I didn't
expect to figure out and answer all the problems alone. I know nothing
about mirror setup.
And it seems there are quite a few users interested by delta though,
so maybe some could help to provide some results about how much space
it could take.

> I'm sure there's more, but that's just "off the cuff". In my eyes,
> this is a complex change that doesn't really seem to benefit too many
> people. If you download 3megs instead of 7, it's not that big of a
> deal and has so many more points of failure to contend with.

The benefit can be much greater than that. I just wrote a quick hack
so that will generate a delta for each package upgrade on my box, and
stores them in a database. The first package that came in :
2,8M openjdk6-1.4-2_to_1.4.1-1-x86_64.delta
67M openjdk6-1.4.1-1-x86_64.pkg.tar.gz

On a decent 1MB/s line, that's a 1 minute difference for a single package.

But yes, it is clearly more complex and there is clearly many more
points of failure.
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-26-2009, 09:34 PM
Aaron Griffin
 
Default delta support in libalpm

On Thu, Feb 26, 2009 at 4:19 PM, Xavier <shiningxc@gmail.com> wrote:
> On Mon, Feb 23, 2009 at 6:48 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
>>
>> Questions which make the implementation complex:
>> * When do we generate deltas? As part of the db scripts?
>
> Well I think that would be practical. When a new package is being
> added, grab the old one, generate a delta, and add it to the database.
> This could be doable.
>
>> * How long do we keep them? 10 previous versions? 5?
>
> I would think 5 is more than enough. Allan suggested more complicated
> ways of cleaning deltas, but we could indeed just use a simple limit
> like that. There is still the problem of finding which are the 5
> newest deltas to be kept.
>
>> * How much additional space is this going to take? How do we set it up
>> so that space-constrained mirrors can opt-out of the deltas?
>>
>
> That's a very good question I didn't consider. But well, I didn't
> expect to figure out and answer all the problems alone. I know nothing
> about mirror setup.
> And it seems there are quite a few users interested by delta though,
> so maybe some could help to provide some results about how much space
> it could take.
>
>> I'm sure there's more, but that's just "off the cuff". In my eyes,
>> this is a complex change that doesn't really seem to benefit too many
>> people. If you download 3megs instead of 7, it's not that big of a
>> deal and has so many more points of failure to contend with.
>
> The benefit can be much greater than that. I just wrote a quick hack
> so that will generate a delta for each package upgrade on my box, and
> stores them in a database. The first package that came in :
> 2,8M openjdk6-1.4-2_to_1.4.1-1-x86_64.delta
> 67M openjdk6-1.4.1-1-x86_64.pkg.tar.gz
>
> On a decent 1MB/s line, that's a 1 minute difference for a single package.
>
> But yes, it is clearly more complex and there is clearly many more
> points of failure.

So, ok, from a db-scripts point of view, we're going to have to do the
following:

when a new package is added:
copy old package file from ftp to build dir
generate delta from old file -> new file (in staging)
add new pkg and delta to DB
? add new delta info _somewhere_?
copy new pkg and delta to ftp

Is this correct? If so, it's not all THAT complex. Less so if repo-add
could simply spit out the deltas on it's own - if it can, we can
simply add the logic to copy od packages to the build dir before
calling repo-add, repo-add realizes there's another package there and
uses it for deltas.

Additionally, we run a cleanup script every few hours to remove old
and/or unused packages
this logic would simply need to be changed to scan deltas and leave
$RETAINED_DELTAS for each package.

I haven't been following the delta stuff too much, can we put the
deltas in a totally unrelated directory? Is there delta information
stored in the pacman DB? If so, the cleanup gets far more complicated?
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-26-2009, 09:53 PM
Allan McRae
 
Default delta support in libalpm

Aaron Griffin wrote:

Is there delta information
stored in the pacman DB? If so, the cleanup gets far more complicated?


Delta information is stored in the repo so removing them is not a simple
delete. As Xavier pointed out, my proposal for removing deltas was
slightly more complicated but I am beginning to see the need for a
script to clean the deltas up - and so I can use my more complicated
removal system . I think whether that script is part of repo-add, or
repo-add calls it when adding/removing a delta depends on how
complicated the script gets.


Anyway, a simple removal system based on number of deltas would be fine
for now and more complicated stuff could be added later.


Allan


_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 
Old 02-26-2009, 10:04 PM
Aaron Griffin
 
Default delta support in libalpm

On Thu, Feb 26, 2009 at 4:53 PM, Allan McRae <allan@archlinux.org> wrote:
> Aaron Griffin wrote:
>>
>> Is there delta information
>> stored in the pacman DB? If so, the cleanup gets far more complicated?
>
> Delta information is stored in the repo so removing them is not a simple
> delete. *As Xavier pointed out, my proposal for removing deltas was slightly
> more complicated but I am beginning to see the need for a script to clean
> the deltas up - and so I can use my more complicated removal system *. *I
> think whether that script is part of repo-add, or repo-add calls it when
> adding/removing a delta depends on how complicated the script gets.
>
> Anyway, a simple removal system based on number of deltas would be fine for
> now and more complicated stuff could be added later.

So... if I hack at this, what would be the process to remove a delta?
Delete the file and then remove a line from a db entry that matches
the file?
_______________________________________________
pacman-dev mailing list
pacman-dev@archlinux.org
http://www.archlinux.org/mailman/listinfo/pacman-dev
 

Thread Tools




All times are GMT. The time now is 03:24 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org