FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian dpkg

 
 
LinkBack Thread Tools
 
Old 02-10-2012, 11:06 PM
Aron Xu
 
Default Endianness of data files in MultiArch

On Fri, Feb 10, 2012 at 19:59, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
>
>> Sorry, the thread was broken and I saw your reply just now.
>>
>> On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <jhr@debian.org> wrote:
>>> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
>>>>
>>>> This is valid for most-used applications/formats like gettext, images
>>>> that are designed to behave in this way, but on the contrary there are
>>>> upstream that don't like to see such impact, especially due to the
>>>> complexity and performance impact.
>>>>
>>>> Currently I am using arch:any for data files which aren't be affected
>>>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
>>>> data that is required to make a library working, I have to force them
>>>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
>>>> as "Multiarch: same", this is sufficient to avoid breakage, but again
>>>> it consumes a lot of space on mirror.
>>>
>>> Actually, what is "a lot" here? I mean, how many libraries are there
>>> containing endianness-critical data and how big are the actual files?
>>> Not that I'm any kind of expert, but this solution sounds reasonable to
>>> me.
>>>
>>> Hauke
>>>
>>
>> As far as I know, there isn't too many libraries known to have
>> endianness-critical data, but there might be landmines because the
>> maintainer just aren't aware about it.
>>
>> I have the chance to notice this problem because my team maintain
>> several stack of input methods, which usually need to deal with
>> linguistic data. [1]
>>
>> For me here is a library named libpinyin at hand to package, which has
>> some data files of ~7.5MiB size after gzip -9 (the total size of this
>> library is no more than 9MiB after gzip -9). We have 14 architectures
>> on ftp-master, so the data file eats up 105MiB, while if we find some
>> way to have only one copy for be/le, it'll only use 15MiB. And think
>> about when it get released as a stable, a new copy of those data is
>> making their way to the archive when new version get uploaded to
>> unstable.
>>
>> Such concern is also valid to other endianness-critical data that are
>> not bothered with Multi-Arch at present, we need to make them arch:any
>> and in the end they are eating more and more space.
>>
>> [1] Performance is critical for these applications, this doesn't mean
>> it consumes a lot of CPU percentage, but it must response very quickly
>> to user's input - do some complex calculations to split a sentence
>> into words and find out a list of most related suggestions, which
>> needs to query from 10^5 ~ 10^6 lines of data several times to
>> complete such an action. There was project tried to use something like
>> SQLite3 but the performance is a bit frustrating, so they have now
>> decided not to care about that but just design data format that can
>> fit for their requirements.
>> --
>> Regards,
>> Aron Xu
>
> It doesn't sound like the data is to big to fit into ram and it sounds
> like the overhead to fetch data from disk on demand would slow you
> down. So there seems to be no reason to have architecture independent
> data on disk and convert it to the right endianess on startup. Sure
> startup time would increase a bit but running time would remain
> unafected.
>

Well, bear in mind that the size is for compressed data. Decompressed
data are usually even larger, their properties on
compressing/decompressing are more like plain texts, so by
decompressing the 7.5MiB data, you get 22MiB on hard disk.

22MiB seems to be not large enough to not fit into RAM, but I'll
explain why it won't. Usually an input method framework carries many
different input methods (it's easier to understand them as different
algorithms), and users are able to switch them on the fly, by a mouse
click or keyboard shortcut. Different input methods have different
data, so by having three installed (this number is below the average),
usually it needs more than 50MiB data.

Hmm, 50MiB seems still not large enough. Linguist data distributed in
a free license are rare compared to the ones provided with non-free
license, and usually their quality and amount is lower/smaller than
non-free ones. Users can download those data (free to download and
use, but not distributable), and use tools provided by input method to
covert the format. This results into 10^6 lines of data, nearly 100MiB
in size. This time it looks rational to not put them into RAM.

Apart from above reasons, switching among input methods also requires
very quick response, it's hard to imagine when you click to switch to
another input method, you have to wait for a couple of seconds (even
minutes), the operation must be completed in a reasonable short time
(<1s) and not cost many resource (users don't want to see there CPU
usage bump to 200% by simply switching between input methods).

> So unless the program is restarted for every input (which would be the
> first thing to eliminate to improve responsiveness) there shouldn't be a
> problem with "fixing" this. It just means extra work you might not be
> willing (or have time) to invest.
>
> MfG
> * * * *Goswin
>
> PS: ia32-libs is about 1GB and is going away. So there should be space
> now for 10 more sources like yours.

I am sure they will be eaten up once Wheey is released. ;-)

--
Regards,
Aron Xu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAMr=8w5b5eXtjrf6CGpEVDQYmFVmpnekbPJcaT8Qc0YpJpjQL w@mail.gmail.com">http://lists.debian.org/CAMr=8w5b5eXtjrf6CGpEVDQYmFVmpnekbPJcaT8Qc0YpJpjQL w@mail.gmail.com
 
Old 02-10-2012, 11:40 PM
Aron Xu
 
Default Endianness of data files in MultiArch

On Sat, Feb 11, 2012 at 00:14, Osamu Aoki <osamu@debian.org> wrote:
> [...]
>
> Just think any phrase data with its content size in 16bit integer.
>
> I have bigger example :-)
>
> ipadic: Uncompressed size: 44.5 M
>
> This one, I made them arch:any to build many binary packages. *Similar
> packages use install time conversion trick to keep them "arch: all" but
> this install takes time.
>

This trick is broken. Dpkg doesn't have similar features like `rpm -V`
at present, which verifies if files on disk are identical to what was
installed. I believe it's useful and will land in dpkg someday (but
don't ask me for patch now...). By coverting data files at user's
install, it breaks when the package manager verifies the file's
integrity. I prefer to use more mirror space to doing such thing if I
have to choose between them, which is the current status.

> [...]
>
> I think PO files cases are manageable. *They can use one endianess for
> all platform.
>
> But for any other generic special purpose natural language processing
> code, it is impossible to force upstream to complicates code to use
> particular endianness.
>

Agreed.

>> So unless the program is restarted for every input (which would be the
>> first thing to eliminate to improve responsiveness) there shouldn't be a
>> problem with "fixing" this. It just means extra work you might not be
>> willing (or have time) to invest.
>
> If we are ready to rewite core of such code, you are right. *But if we
> simply accept upstream code design, we will endup making multiple of
> such semi-arch depended data in archive as arch: any.
>
> Osamu

IMHO it's really bad to maintain such a delta between Debian and upstream.

--
Regards,
Aron Xu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAMr=8w4a8yY1udLDhZhewDsuZ3rdjqeK208vgmPLy1NtxyyTT Q@mail.gmail.com">http://lists.debian.org/CAMr=8w4a8yY1udLDhZhewDsuZ3rdjqeK208vgmPLy1NtxyyTT Q@mail.gmail.com
 
Old 02-11-2012, 12:36 AM
Russ Allbery
 
Default Endianness of data files in MultiArch

Aron Xu <happyaron.xu@gmail.com> writes:

> This trick is broken. Dpkg doesn't have similar features like `rpm -V`
> at present, which verifies if files on disk are identical to what was
> installed.

That's what debsums does.

--
Russ Allbery (rra@debian.org) <http://www.eyrie.org/~eagle/>


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87vcnerw1k.fsf@windlord.stanford.edu">http://lists.debian.org/87vcnerw1k.fsf@windlord.stanford.edu
 
Old 02-11-2012, 12:38 AM
Aron Xu
 
Default Endianness of data files in MultiArch

On Sat, Feb 11, 2012 at 09:36, Russ Allbery <rra@debian.org> wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
>
>> This trick is broken. Dpkg doesn't have similar features like `rpm -V`
>> at present, which verifies if files on disk are identical to what was
>> installed.
>
> That's what debsums does.
>

Thanks for updating me!



--
Regards,
Aron Xu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAMr=8w4Jj--_Sdcj6ON91PHueqpfgC_+f07L4Sb-HnwS7yZ+iA@mail.gmail.com">http://lists.debian.org/CAMr=8w4Jj--_Sdcj6ON91PHueqpfgC_+f07L4Sb-HnwS7yZ+iA@mail.gmail.com
 
Old 02-11-2012, 04:48 AM
Osamu Aoki
 
Default Endianness of data files in MultiArch

Hi,

On Sat, Feb 11, 2012 at 08:40:35AM +0800, Aron Xu wrote:
> On Sat, Feb 11, 2012 at 00:14, Osamu Aoki <osamu@debian.org> wrote:
> > [...]
> >
> > Just think any phrase data with its content size in 16bit integer.
> >
> > I have bigger example :-)
> >
> > ipadic: Uncompressed size: 44.5 M
> >
> > This one, I made them arch:any to build many binary packages. *Similar
> > packages use install time conversion trick to keep them "arch: all" but
> > this install takes time.
> >
>
> This trick is broken. Dpkg doesn't have similar features like `rpm -V`
> at present, which verifies if files on disk are identical to what was
> installed.

If it installs into /usr/share ... I agree it is broken.

But if postinst installs into /var/lib/<pkg>/... using arch indep text
data in /usr/share data, it is OK. Just a bit too much data
duplication, though.

> I believe it's useful and will land in dpkg someday (but
> don't ask me for patch now...). By coverting data files at user's
> install, it breaks when the package manager verifies the file's
> integrity. I prefer to use more mirror space to doing such thing if I
> have to choose between them, which is the current status.
>
> > [...]
> >
> > I think PO files cases are manageable. *They can use one endianess for
> > all platform.
> >
> > But for any other generic special purpose natural language processing
> > code, it is impossible to force upstream to complicates code to use
> > particular endianness.
> >
>
> Agreed.
>
> >> So unless the program is restarted for every input (which would be the
> >> first thing to eliminate to improve responsiveness) there shouldn't be a
> >> problem with "fixing" this. It just means extra work you might not be
> >> willing (or have time) to invest.
> >
> > If we are ready to rewite core of such code, you are right. *But if we
> > simply accept upstream code design, we will endup making multiple of
> > such semi-arch depended data in archive as arch: any.
> >
> > Osamu
>
> IMHO it's really bad to maintain such a delta between Debian and upstream.

I agree and I do not think I want to do that.

Regards,

Osamu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120211054808.GB29621@localhost">http://lists.debian.org/20120211054808.GB29621@localhost
 
Old 02-11-2012, 03:01 PM
Goswin von Brederlow
 
Default Endianness of data files in MultiArch

Aron Xu <happyaron.xu@gmail.com> writes:

> On Sat, Feb 11, 2012 at 00:14, Osamu Aoki <osamu@debian.org> wrote:
>> [...]
>>
>> Just think any phrase data with its content size in 16bit integer.
>>
>> I have bigger example :-)
>>
>> ipadic: Uncompressed size: 44.5 M
>>
>> This one, I made them arch:any to build many binary packages. *Similar
>> packages use install time conversion trick to keep them "arch: all" but
>> this install takes time.
>>
>
> This trick is broken. Dpkg doesn't have similar features like `rpm -V`
> at present, which verifies if files on disk are identical to what was
> installed. I believe it's useful and will land in dpkg someday (but
> don't ask me for patch now...). By coverting data files at user's
> install, it breaks when the package manager verifies the file's
> integrity. I prefer to use more mirror space to doing such thing if I
> have to choose between them, which is the current status.

It also breaks if you export /usr/share to systems of different archs
(nobody actualy does that it seems) and will break with multiarch (far
more likely someone will mix archs there).

>>> So unless the program is restarted for every input (which would be the
>>> first thing to eliminate to improve responsiveness) there shouldn't be a
>>> problem with "fixing" this. It just means extra work you might not be
>>> willing (or have time) to invest.
>>
>> If we are ready to rewite core of such code, you are right. *But if we
>> simply accept upstream code design, we will endup making multiple of
>> such semi-arch depended data in archive as arch: any.
>>
>> Osamu
>
> IMHO it's really bad to maintain such a delta between Debian and upstream.

Depends on the size of the patch. Then again if the patch is small
upstream will probably just include it.

It will just be a matter of waying costs and benefits. How much data is
there? How difficult would it be to convert the data at startup? How
difficult would it be to build a -le and -be data package (on a single
arch)? and so on. There won't be one solution that fits all and I think
there will be packages for each solution presented so far.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 8762fde4um.fsf@frosties.localnet">http://lists.debian.org/8762fde4um.fsf@frosties.localnet
 

Thread Tools




All times are GMT. The time now is 01:33 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org