Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Debian dpkg (http://www.linux-archive.org/debian-dpkg/)
-   -   Endianness of data files in MultiArch (http://www.linux-archive.org/debian-dpkg/630681-endianness-data-files-multiarch.html)

Goswin von Brederlow 02-09-2012 11:52 AM

Endianness of data files in MultiArch
 
Aron Xu <happyaron.xu@gmail.com> writes:

> On Thu, Feb 9, 2012 at 01:35, Simon McVittie <smcv@debian.org> wrote:
>> On 08/02/12 17:22, Aron Xu wrote:
>>> Some packages come with data files that endianness matters, and many
>>> of them are large enough to split into a separate arch:all package if
>>> endianness were not something to care about. AFAIK some maintainers
>>> are not aware of endianness issues in their packages and then just
>>> ignored it (not sure how many, but if any of them are discovered it
>>> should lead to RC bug).
>>
>> Hopefully Jakub Wilk's automatic checks for conflicting files
>> <http://people.debian.org/~jwilk/multi-arch/> will already be picking
>> this up, in cases where the less-used-endianness architectures aren't
>> broken already.
>>
>> If the less-used-endianness architectures are already broken, that's
>> also a bug (potentially an RC one), just like code that compiles but
>> doesn't work on a particular endianness due to other assumptions - and
>> if nobody has noticed it yet, presumably the package doesn't have any
>> users (or regression tests) on those architectures.
>>
>
> Or some of them just gave up because it is "less-used" architecture.
>
>>> It would be great to have some mechanism to
>>> handle such kind of problems in Debian, to avoid forcing those data to
>>> be placed into arch:any package.
>>
>> If the right endianness is critical: libfoo:i386 Depends:
>> libfoo-data-le, libfoo:powerpc Depends: libfoo-data-be, both data
>> packages arch:all, data files in /usr/share/foo/le and /usr/share/foo/be
>> respectively?

I would have them conflict and use the same directory. Otherwise you
need to use different paths in the binary, docs and maybe even
conffiles (which would then be architecture dependend too).

> This looks not very nice, because we need to maintain a list of
> architectures in debian/control, and when new architectures are added
> the package is potentially broken.

If endian dependend data is really a larger issue then introduce a

dpkg-architecture -qDEB_HOST_ENDIANESS

> Also, arch:all packages are usually generated by the uploading DD on
> one architecture, mostly amd64 and i386 today, how can he managed to
> generate be data files if he doesn't have access to such a machine?
> Adding an option to the data generator/parser and make it able to
> generate be/le data on any architecture seems not to be a reasonable
> approach.

That is indeed the biggest problem currently. Also a problem for the
idea of building arch:all packages on buildds. They might not build on
all archs.

>> Or just make sure the data has an endianness marker, and enhance the
>> reading package to do the right byteswapping based on the endianness
>> marker - e.g. this has been discussed for gettext, which ended up just
>> writing out the same endianness on all platforms. Many formats
>> (particularly those that originated on Windows) are always
>> little-endian, and big-endian platforms reading them just take the minor
>> performance hit; formats that respect "network byte order" have the
>> opposite situation.
>>
>
> This is valid for most-used applications/formats like gettext, images
> that are designed to behave in this way, but on the contrary there are
> upstream that don't like to see such impact, especially due to the
> complexity and performance impact.
>
> Currently I am using arch:any for data files which aren't be affected
> with multiarch, i.e. not "same" or "foreign". For endianness-critical
> data that is required to make a library working, I have to force them
> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
> as "Multiarch: same", this is sufficient to avoid breakage, but again
> it consumes a lot of space on mirror.
>
> I thought about something like /usr/share/$package/data/{be,le} in
> arch:all, but appears to be not a reasonable solution because we need
> to modify the data generator/parser.

It should be possible to build a converter or generator that can output
either endianess. So you could have a single arch:all package with both
/usr/share/$package/data/{be,le} in it or to generate the right
endianness on install. That way the "performance impact" argument is non
existant.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87pqdoqict.fsf@frosties.localnet">http://lists.debian.org/87pqdoqict.fsf@frosties.localnet

Goswin von Brederlow 02-09-2012 11:52 AM

Endianness of data files in MultiArch
 
Aron Xu <happyaron.xu@gmail.com> writes:

> On Thu, Feb 9, 2012 at 01:35, Simon McVittie <smcv@debian.org> wrote:
>> On 08/02/12 17:22, Aron Xu wrote:
>>> Some packages come with data files that endianness matters, and many
>>> of them are large enough to split into a separate arch:all package if
>>> endianness were not something to care about. AFAIK some maintainers
>>> are not aware of endianness issues in their packages and then just
>>> ignored it (not sure how many, but if any of them are discovered it
>>> should lead to RC bug).
>>
>> Hopefully Jakub Wilk's automatic checks for conflicting files
>> <http://people.debian.org/~jwilk/multi-arch/> will already be picking
>> this up, in cases where the less-used-endianness architectures aren't
>> broken already.
>>
>> If the less-used-endianness architectures are already broken, that's
>> also a bug (potentially an RC one), just like code that compiles but
>> doesn't work on a particular endianness due to other assumptions - and
>> if nobody has noticed it yet, presumably the package doesn't have any
>> users (or regression tests) on those architectures.
>>
>
> Or some of them just gave up because it is "less-used" architecture.
>
>>> It would be great to have some mechanism to
>>> handle such kind of problems in Debian, to avoid forcing those data to
>>> be placed into arch:any package.
>>
>> If the right endianness is critical: libfoo:i386 Depends:
>> libfoo-data-le, libfoo:powerpc Depends: libfoo-data-be, both data
>> packages arch:all, data files in /usr/share/foo/le and /usr/share/foo/be
>> respectively?

I would have them conflict and use the same directory. Otherwise you
need to use different paths in the binary, docs and maybe even
conffiles (which would then be architecture dependend too).

> This looks not very nice, because we need to maintain a list of
> architectures in debian/control, and when new architectures are added
> the package is potentially broken.

If endian dependend data is really a larger issue then introduce a

dpkg-architecture -qDEB_HOST_ENDIANESS

> Also, arch:all packages are usually generated by the uploading DD on
> one architecture, mostly amd64 and i386 today, how can he managed to
> generate be data files if he doesn't have access to such a machine?
> Adding an option to the data generator/parser and make it able to
> generate be/le data on any architecture seems not to be a reasonable
> approach.

That is indeed the biggest problem currently. Also a problem for the
idea of building arch:all packages on buildds. They might not build on
all archs.

>> Or just make sure the data has an endianness marker, and enhance the
>> reading package to do the right byteswapping based on the endianness
>> marker - e.g. this has been discussed for gettext, which ended up just
>> writing out the same endianness on all platforms. Many formats
>> (particularly those that originated on Windows) are always
>> little-endian, and big-endian platforms reading them just take the minor
>> performance hit; formats that respect "network byte order" have the
>> opposite situation.
>>
>
> This is valid for most-used applications/formats like gettext, images
> that are designed to behave in this way, but on the contrary there are
> upstream that don't like to see such impact, especially due to the
> complexity and performance impact.
>
> Currently I am using arch:any for data files which aren't be affected
> with multiarch, i.e. not "same" or "foreign". For endianness-critical
> data that is required to make a library working, I have to force them
> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
> as "Multiarch: same", this is sufficient to avoid breakage, but again
> it consumes a lot of space on mirror.
>
> I thought about something like /usr/share/$package/data/{be,le} in
> arch:all, but appears to be not a reasonable solution because we need
> to modify the data generator/parser.

It should be possible to build a converter or generator that can output
either endianess. So you could have a single arch:all package with both
/usr/share/$package/data/{be,le} in it or to generate the right
endianness on install. That way the "performance impact" argument is non
existant.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87pqdoqict.fsf@frosties.localnet">http://lists.debian.org/87pqdoqict.fsf@frosties.localnet

Guillem Jover 02-09-2012 11:58 AM

Endianness of data files in MultiArch
 
On Thu, 2012-02-09 at 13:52:34 +0100, Goswin von Brederlow wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
> > This looks not very nice, because we need to maintain a list of
> > architectures in debian/control, and when new architectures are added
> > the package is potentially broken.
>
> If endian dependend data is really a larger issue then introduce a
>
> dpkg-architecture -qDEB_HOST_ENDIANESS

This already exists: DEB_BUILD_ARCH_ENDIAN and DEB_HOST_ARCH_ENDIAN

regards,
guillem


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120209125803.GA12356@gaara.hadrons.org">http://lists.debian.org/20120209125803.GA12356@gaara.hadrons.org

Guillem Jover 02-09-2012 11:58 AM

Endianness of data files in MultiArch
 
On Thu, 2012-02-09 at 13:52:34 +0100, Goswin von Brederlow wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
> > This looks not very nice, because we need to maintain a list of
> > architectures in debian/control, and when new architectures are added
> > the package is potentially broken.
>
> If endian dependend data is really a larger issue then introduce a
>
> dpkg-architecture -qDEB_HOST_ENDIANESS

This already exists: DEB_BUILD_ARCH_ENDIAN and DEB_HOST_ARCH_ENDIAN

regards,
guillem


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120209125803.GA12356@gaara.hadrons.org">http://lists.debian.org/20120209125803.GA12356@gaara.hadrons.org

Goswin von Brederlow 02-09-2012 02:04 PM

Endianness of data files in MultiArch
 
Guillem Jover <guillem@debian.org> writes:

> On Thu, 2012-02-09 at 13:52:34 +0100, Goswin von Brederlow wrote:
>> Aron Xu <happyaron.xu@gmail.com> writes:
>> > This looks not very nice, because we need to maintain a list of
>> > architectures in debian/control, and when new architectures are added
>> > the package is potentially broken.
>>
>> If endian dependend data is really a larger issue then introduce a
>>
>> dpkg-architecture -qDEB_HOST_ENDIANESS
>
> This already exists: DEB_BUILD_ARCH_ENDIAN and DEB_HOST_ARCH_ENDIAN
>
> regards,
> guillem

Even better. Should have tested in a sid chroot.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 874nv0oxo4.fsf@frosties.localnet">http://lists.debian.org/874nv0oxo4.fsf@frosties.localnet

Goswin von Brederlow 02-09-2012 02:04 PM

Endianness of data files in MultiArch
 
Guillem Jover <guillem@debian.org> writes:

> On Thu, 2012-02-09 at 13:52:34 +0100, Goswin von Brederlow wrote:
>> Aron Xu <happyaron.xu@gmail.com> writes:
>> > This looks not very nice, because we need to maintain a list of
>> > architectures in debian/control, and when new architectures are added
>> > the package is potentially broken.
>>
>> If endian dependend data is really a larger issue then introduce a
>>
>> dpkg-architecture -qDEB_HOST_ENDIANESS
>
> This already exists: DEB_BUILD_ARCH_ENDIAN and DEB_HOST_ARCH_ENDIAN
>
> regards,
> guillem

Even better. Should have tested in a sid chroot.

MfG
Goswin


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 874nv0oxo4.fsf@frosties.localnet">http://lists.debian.org/874nv0oxo4.fsf@frosties.localnet

Aron Xu 02-10-2012 12:39 AM

Endianness of data files in MultiArch
 
On Thu, Feb 9, 2012 at 20:52, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
> It should be possible to build a converter or generator that can output
> either endianess. So you could have a single arch:all package with both
> /usr/share/$package/data/{be,le} in it or to generate the right
> endianness on install. That way the "performance impact" argument is non
> existant.
>

Yes, it's "possible", but it requires additional work for both
upstream/debian maintainer to care the case a lot. IMHO this idea is
not very constructive for finding a better solution than the current
way.


--
Regards,
Aron Xu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAMr=8w66hu3L_nyr9egMJbTdQ4Kyr8iSsaHc2hUNgtGq0p4Gg g@mail.gmail.com">http://lists.debian.org/CAMr=8w66hu3L_nyr9egMJbTdQ4Kyr8iSsaHc2hUNgtGq0p4Gg g@mail.gmail.com

Aron Xu 02-10-2012 12:39 AM

Endianness of data files in MultiArch
 
On Thu, Feb 9, 2012 at 20:52, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
> It should be possible to build a converter or generator that can output
> either endianess. So you could have a single arch:all package with both
> /usr/share/$package/data/{be,le} in it or to generate the right
> endianness on install. That way the "performance impact" argument is non
> existant.
>

Yes, it's "possible", but it requires additional work for both
upstream/debian maintainer to care the case a lot. IMHO this idea is
not very constructive for finding a better solution than the current
way.


--
Regards,
Aron Xu


--
To UNSUBSCRIBE, email to debian-dpkg-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAMr=8w66hu3L_nyr9egMJbTdQ4Kyr8iSsaHc2hUNgtGq0p4Gg g@mail.gmail.com">http://lists.debian.org/CAMr=8w66hu3L_nyr9egMJbTdQ4Kyr8iSsaHc2hUNgtGq0p4Gg g@mail.gmail.com

Goswin von Brederlow 02-10-2012 10:59 AM

Endianness of data files in MultiArch
 
Aron Xu <happyaron.xu@gmail.com> writes:

> Sorry, the thread was broken and I saw your reply just now.
>
> On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <jhr@debian.org> wrote:
>> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
>>>
>>> This is valid for most-used applications/formats like gettext, images
>>> that are designed to behave in this way, but on the contrary there are
>>> upstream that don't like to see such impact, especially due to the
>>> complexity and performance impact.
>>>
>>> Currently I am using arch:any for data files which aren't be affected
>>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
>>> data that is required to make a library working, I have to force them
>>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
>>> as "Multiarch: same", this is sufficient to avoid breakage, but again
>>> it consumes a lot of space on mirror.
>>
>> Actually, what is "a lot" here? I mean, how many libraries are there
>> containing endianness-critical data and how big are the actual files?
>> Not that I'm any kind of expert, but this solution sounds reasonable to
>> me.
>>
>> Hauke
>>
>
> As far as I know, there isn't too many libraries known to have
> endianness-critical data, but there might be landmines because the
> maintainer just aren't aware about it.
>
> I have the chance to notice this problem because my team maintain
> several stack of input methods, which usually need to deal with
> linguistic data. [1]
>
> For me here is a library named libpinyin at hand to package, which has
> some data files of ~7.5MiB size after gzip -9 (the total size of this
> library is no more than 9MiB after gzip -9). We have 14 architectures
> on ftp-master, so the data file eats up 105MiB, while if we find some
> way to have only one copy for be/le, it'll only use 15MiB. And think
> about when it get released as a stable, a new copy of those data is
> making their way to the archive when new version get uploaded to
> unstable.
>
> Such concern is also valid to other endianness-critical data that are
> not bothered with Multi-Arch at present, we need to make them arch:any
> and in the end they are eating more and more space.
>
> [1] Performance is critical for these applications, this doesn't mean
> it consumes a lot of CPU percentage, but it must response very quickly
> to user's input - do some complex calculations to split a sentence
> into words and find out a list of most related suggestions, which
> needs to query from 10^5 ~ 10^6 lines of data several times to
> complete such an action. There was project tried to use something like
> SQLite3 but the performance is a bit frustrating, so they have now
> decided not to care about that but just design data format that can
> fit for their requirements.
> --
> Regards,
> Aron Xu

It doesn't sound like the data is to big to fit into ram and it sounds
like the overhead to fetch data from disk on demand would slow you
down. So there seems to be no reason to have architecture independent
data on disk and convert it to the right endianess on startup. Sure
startup time would increase a bit but running time would remain
unafected.

So unless the program is restarted for every input (which would be the
first thing to eliminate to improve responsiveness) there shouldn't be a
problem with "fixing" this. It just means extra work you might not be
willing (or have time) to invest.

MfG
Goswin

PS: ia32-libs is about 1GB and is going away. So there should be space
now for 10 more sources like yours. :)


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87fweihpbg.fsf@frosties.localnet">http://lists.debian.org/87fweihpbg.fsf@frosties.localnet

Osamu Aoki 02-10-2012 03:14 PM

Endianness of data files in MultiArch
 
Hi,

On Fri, Feb 10, 2012 at 12:59:15PM +0100, Goswin von Brederlow wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
>
> > Sorry, the thread was broken and I saw your reply just now.
> >
> > On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <jhr@debian.org> wrote:
> >> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
> >>>
> >>> This is valid for most-used applications/formats like gettext, images
> >>> that are designed to behave in this way, but on the contrary there are
> >>> upstream that don't like to see such impact, especially due to the
> >>> complexity and performance impact.
> >>>
> >>> Currently I am using arch:any for data files which aren't be affected
> >>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
> >>> data that is required to make a library working, I have to force them
> >>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
> >>> as "Multiarch: same", this is sufficient to avoid breakage, but again
> >>> it consumes a lot of space on mirror.
> >>
> >> Actually, what is "a lot" here? I mean, how many libraries are there
> >> containing endianness-critical data and how big are the actual files?
> >> Not that I'm any kind of expert, but this solution sounds reasonable to
> >> me.
> >>
> >> Hauke
> >>
> >
> > As far as I know, there isn't too many libraries known to have
> > endianness-critical data, but there might be landmines because the
> > maintainer just aren't aware about it.
> >
> > I have the chance to notice this problem because my team maintain
> > several stack of input methods, which usually need to deal with
> > linguistic data. [1]
> >
> > For me here is a library named libpinyin at hand to package, which has
> > some data files of ~7.5MiB size after gzip -9 (the total size of this
> > library is no more than 9MiB after gzip -9). We have 14 architectures
> > on ftp-master, so the data file eats up 105MiB, while if we find some
> > way to have only one copy for be/le, it'll only use 15MiB. And think
> > about when it get released as a stable, a new copy of those data is
> > making their way to the archive when new version get uploaded to
> > unstable.

Just think any phrase data with its content size in 16bit integer.

I have bigger example :-)

ipadic: Uncompressed size: 44.5 M

This one, I made them arch:any to build many binary packages. Similar
packages use install time conversion trick to keep them "arch: all" but
this install takes time.

naist-jdic: Uncompressed size: 28.5 M (based on my vague memory)

> > Such concern is also valid to other endianness-critical data that are
> > not bothered with Multi-Arch at present, we need to make them arch:any
> > and in the end they are eating more and more space.
> >
> > [1] Performance is critical for these applications, this doesn't mean
> > it consumes a lot of CPU percentage, but it must response very quickly
> > to user's input - do some complex calculations to split a sentence
> > into words and find out a list of most related suggestions, which
> > needs to query from 10^5 ~ 10^6 lines of data several times to
> > complete such an action. There was project tried to use something like
> > SQLite3 but the performance is a bit frustrating, so they have now
> > decided not to care about that but just design data format that can
> > fit for their requirements.
> > --
> > Regards,
> > Aron Xu
>
> It doesn't sound like the data is to big to fit into ram and it sounds
> like the overhead to fetch data from disk on demand would slow you
> down. So there seems to be no reason to have architecture independent
> data on disk and convert it to the right endianess on startup. Sure
> startup time would increase a bit but running time would remain
> unafected.

I think PO files cases are manageable. They can use one endianess for
all platform.

But for any other generic special purpose natural language processing
code, it is impossible to force upstream to complicates code to use
particular endianness.

> So unless the program is restarted for every input (which would be the
> first thing to eliminate to improve responsiveness) there shouldn't be a
> problem with "fixing" this. It just means extra work you might not be
> willing (or have time) to invest.

If we are ready to rewite core of such code, you are right. But if we
simply accept upstream code design, we will endup making multiple of
such semi-arch depended data in archive as arch: any.

Osamu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120210161440.GA5330@localhost">http://lists.debian.org/20120210161440.GA5330@localhost


All times are GMT. The time now is 01:12 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.