FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 11-04-2010, 06:29 PM
Rob Gom
 
Default Locales/sort bug

Hi all,
do you think it's a bug in either libc or coreutils (sort)?

$ cat test.csv
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

$ LC_ALL=C sort test.csv # expected
aph3,"APP",""
aph3,"MiB",""
aph3_devel,"TXT",""

$ LC_ALL=pl_PL sort test.csv # why is that?
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

$ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

Could anyone give me a hint? I know that this is LC_COLLATE related
(LC_ALL as shorter version), but don't know whether it is my fault or
upstream bug.

I'd appreciate any comments.

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: AANLkTimSt_3JNPkWaHYV81C4=A07UdQo5unnDm47hywc@mail .gmail.com">http://lists.debian.org/AANLkTimSt_3JNPkWaHYV81C4=A07UdQo5unnDm47hywc@mail .gmail.com
 
Old 11-04-2010, 07:16 PM
Camalen
 
Default Locales/sort bug

On Thu, 04 Nov 2010 20:29:02 +0100, Rob Gom wrote:

> do you think it's a bug in either libc or coreutils (sort)?
>
> $ cat test.csv
> aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> $ LC_ALL=C sort test.csv # expected
> aph3,"APP",""
> aph3,"MiB",""
> aph3_devel,"TXT",""
>
> $ LC_ALL=pl_PL sort test.csv # why is that? aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
> aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> Could anyone give me a hint? I know that this is LC_COLLATE related
> (LC_ALL as shorter version), but don't know whether it is my fault or
> upstream bug.

I'm also getting that behaviour (locale set to "es_ES.UTF-8") so I
understand that my locale setting dictates "underscore" ("_") comes first
than "comma" (",") symbol.

As per "man sort" page:

*** WARNING *** The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.

Do you think that is a bug? :-?

Greetings,

--
Camalen


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: pan.2010.11.04.20.16.48@gmail.com">http://lists.debian.org/pan.2010.11.04.20.16.48@gmail.com
 
Old 11-04-2010, 07:19 PM
Ron Johnson
 
Default Locales/sort bug

On 11/04/2010 02:29 PM, Rob Gom wrote:

Hi all,
do you think it's a bug in either libc or coreutils (sort)?

$ cat test.csv
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

$ LC_ALL=C sort test.csv # expected
aph3,"APP",""
aph3,"MiB",""
aph3_devel,"TXT",""

$ LC_ALL=pl_PL sort test.csv # why is that?
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

$ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""

Could anyone give me a hint? I know that this is LC_COLLATE related
(LC_ALL as shorter version), but don't know whether it is my fault or
upstream bug.

I'd appreciate any comments.



While it *might* be an upstream bug, it's unlikely. (The first
thing I learned in my first CompSci class is that it's not the
compiler's fault that my program doesn't work...)


You just don't know what the Polish "ASCII" collating sequence is.

--
Seek truth from facts.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4CD31538.7060804@cox.net">http://lists.debian.org/4CD31538.7060804@cox.net
 
Old 11-04-2010, 07:23 PM
Rob Gom
 
Default Locales/sort bug

[cut]
>
> I'm also getting that behaviour (locale set to "es_ES.UTF-8") so I
> understand that my locale setting dictates "underscore" ("_") comes first
> than "comma" (",") symbol.
>
> As per "man sort" page:
>
> *** WARNING *** The locale specified by the environment affects sort
> order. Set LC_ALL=C to get the traditional sort order that uses native
> byte values.
>
> Do you think that is a bug? :-?
>
> Greetings,
>
> --
> Camalen

If so, why do I get order comma, underscore, comma? Even better,
comma+quote+A, underscore+d,comma+quote+M. I don't get it...

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: AANLkTi=VM8um=jxkziGySYTyzd97F39YsynofBHHC8d5@mail .gmail.com">http://lists.debian.org/AANLkTi=VM8um=jxkziGySYTyzd97F39YsynofBHHC8d5@mail .gmail.com
 
Old 11-04-2010, 07:25 PM
Sven Joachim
 
Default Locales/sort bug

On 2010-11-04 20:29 +0100, Rob Gom wrote:

> Hi all,
> do you think it's a bug in either libc or coreutils (sort)?
>
> $ cat test.csv
> aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> $ LC_ALL=C sort test.csv # expected
> aph3,"APP",""
> aph3,"MiB",""
> aph3_devel,"TXT",""
>
> $ LC_ALL=pl_PL sort test.csv # why is that?
> aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> $ LC_ALL=pl_PL.UTF-8 sort test.csv # another unexpected output
> aph3,"APP",""
> aph3_devel,"TXT",""
> aph3,"MiB",""
>
> Could anyone give me a hint? I know that this is LC_COLLATE related
> (LC_ALL as shorter version), but don't know whether it is my fault or
> upstream bug.
>
> I'd appreciate any comments.

This is covered by the coreutils FAQ:
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

Sven


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 87lj58n8em.fsf@turtle.gmx.de">http://lists.debian.org/87lj58n8em.fsf@turtle.gmx.de
 
Old 11-04-2010, 07:43 PM
Rob Gom
 
Default Locales/sort bug

[cut]
>
> This is covered by the coreutils FAQ:
> http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
>
> Sven
>
Thanks for all the answers.

How could I know that collate is defined correctly? I understand
LC_COLLATE influence on sort operation, but I am not sure if this is
ok.
The simpliest example which causes weird behaviour is:

$ cat test2.csv
,"A
_d
,"M


$ LC_ALL=pl_PL sort test2.csv # and many other LC_COLLATE variants,
other than C/POSIX
,"A
_d
,"M

In order to achieve such behaviour, ',"' should be defined as single
entity in collate definition, equal in ordering to '_'. I don't have
other explanation for that. Unfortunately, I am not good enough to
understand/verify collate definition in /usr/share/i18n

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: AANLkTik3FUbMR0OcLOChJcOsyQwHCAHgSjvOhTFrjH4e@mail .gmail.com">http://lists.debian.org/AANLkTik3FUbMR0OcLOChJcOsyQwHCAHgSjvOhTFrjH4e@mail .gmail.com
 
Old 11-04-2010, 07:55 PM
Rob Gom
 
Default Locales/sort bug

One more thing.
If I specify LC_COLLATE to C/POSIX, special characters sorting looks
fine, but I lose Polish characters ordering.
If I specify LC_COLLATE to pl_PL.UTF-8, Polish characters ordering is
fine, but sorting goes crazy with special characters.
Is it possible to retain both features then?

carramba@laptop-rg:/tmp$ cat test2.csv
,"A
_d
,"M
a
ą
b
ż
ć
z
carramba@laptop-rg:/tmp$ LC_ALL=POSIX sort test2.csv
,"A
,"M
_d
a
b
z
ą
ć
ż

# above - correct special characters, Polish in wrong order

carramba@laptop-rg:/tmp$ LC_ALL=pl_PL.UTF-8 sort test2.csv
a
,"A
ą
b
ć
_d
,"M
z
ż

# above - correct Polish characters order, incorrect special characters

Feel free to replace 'correct' with 'expected' in my posts, I'm just
trying to understand what's under the hood.

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: AANLkTi=pu2NP+fiLSqx6Vcjj32sqNYAJQpOPQw33vpC2@mail .gmail.com">http://lists.debian.org/AANLkTi=pu2NP+fiLSqx6Vcjj32sqNYAJQpOPQw33vpC2@mail .gmail.com
 
Old 11-04-2010, 08:06 PM
Rob Gom
 
Default Locales/sort bug

I have some form of workaround.
When I know sort field separator (which was the case in my original
example), I can use that to overcome the limitations with:

$ LC_ALL=pl_PL.UTF-8 sort -k1,1 -t',' test.csv
aph3,"APP",""
aph3,"MiB",""
aph3_devel,"TXT",""
# everything fine

$ LC_ALL=pl_PL.UTF-8 sort test.csv
aph3,"APP",""
aph3_devel,"TXT",""
aph3,"MiB",""
# previous results, unexpected

My conclusion for now would be:
- if you don't know field separator
-- if there are only ASCII characters - use POSIX collate
-- if there are different characters (i18n) - don't have solution
- if you know field separator
-- specify it in sort command

Regards,
Robert


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: AANLkTinLoxDzJ9hdAfvbCQ8++J5V0Jecf5WKHkSy46R3@mail .gmail.com">http://lists.debian.org/AANLkTinLoxDzJ9hdAfvbCQ8++J5V0Jecf5WKHkSy46R3@mail .gmail.com
 

Thread Tools




All times are GMT. The time now is 12:14 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org