FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 09-01-2012, 11:32 PM
"Dan B."
 
Default What does charset in locale setting affect?

In a locale setting such as en_US.UTF-8 (e.g., LANG=en_US.UTF-8),
what exactly does the charset/character encoding part (UTF-8) affect?

Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
different based on the charset portion of the local setting?


Thanks,
Daniel


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 50429B20.7030305@kempt.net">http://lists.debian.org/50429B20.7030305@kempt.net
 
Old 09-02-2012, 09:53 AM
Roger Leigh
 
Default What does charset in locale setting affect?

On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
> In a locale setting such as en_US.UTF-8 (e.g., LANG=en_US.UTF-8),
> what exactly does the charset/character encoding part (UTF-8) affect?

This affects the character encoding that programs use for input
and output. For example, if you want to print the character
‘á’ (Unicode code point 0x00E1), you will output this as UTF-8 as
the byte sequence
0xc3 0xa1
However, in a Latin 1 (ISO-8859-1) locale, this would be printed
as
0xe1
and in other encodings, it will be a different byte sequence yet
again.

> Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> different based on the charset portion of the local setting?

All of them, in short.

When you run a terminal emulator such as xterm, it will get the
encoding to use inside the emulator using nl_langinfo(3). This returns
the name of the character encoding used in the locale. This will
ensure that it knows the encoding used by programs so that it can
correctly display them, and likewise for the input it sends to them.
If the encoding was incorrect, it would otherwise display garbage.

When you run sed/grep, the encoding will affect how it processes the
text. It's therefore important to use the same encoding in your files
as you have set in your locale. Before we had UTF-8, the old 8-bit
encodings didn't necessarily match your locale, and you couldn't tell
what they were supposed to be, so using UTF-8 everywhere has been a
massive improvement.

This is generally completely transparent. For example, if you were
to write (in C), the following code:

#include <stdio.h>
#include <locale.h>

int main(void)
{
setlocale(LC_ALL, "");
printf("á
");
return 0;
}

This will work correctly in any locale. GCC defaults to using UTF-8
internally, and will translate it to the user's locale encoding on
output.

Nowadays, there's little reason to use any encoding other than UTF-8;
all the others are a subset of UTF-8 and only present for legacy and
compatibility reasons.


Regards,
Roger

--
.'`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' schroot and sbuild http://alioth.debian.org/projects/buildd-tools
`- GPG Public Key F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120902095315.GD3198@codelibre.net">http://lists.debian.org/20120902095315.GD3198@codelibre.net
 
Old 09-02-2012, 03:08 PM
Camalen
 
Default What does charset in locale setting affect?

On Sat, 01 Sep 2012 19:32:48 -0400, Dan B. wrote:

> In a locale setting such as en_US.UTF-8 (e.g., LANG=en_US.UTF-8), what
> exactly does the charset/character encoding part (UTF-8) affect?
>
> Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> different based on the charset portion of the local setting?

Debian's Reference Manual has a small section about that ("8.3.2.
Rationale for UTF-8 locale"):

http://www.debian.org/doc/manuals/debian-reference/ch08.en.html

Anyway, it seems that today everything and everybody is moving towards
unicode and utf-8.

Greetings,

--
Camalen


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/k1vsqb$tli$7@ger.gmane.org
 
Old 09-03-2012, 03:11 AM
"Dan B."
 
Default What does charset in locale setting affect?

Roger Leigh wrote:

On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
...


Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
different based on the charset portion of the local setting?


All of them, in short.

When you run a terminal emulator such as xterm, it will get the
encoding to use inside the emulator using nl_langinfo(3). ...



What about the virtual consoles?

Whether I choose a default system locale of UTF-8 or None (in the
dialog for "dpkg-reconfigure locales"), and log out and log in (to
make sure the shell has a chance to get fresh settings), then

echo $'xC2xA2'

displays the same thing (the cent sign).

Is the virtual console supposed to follow the locale's character
encoding? If so, does something else (e.g., something in /etc/init.d/)
need to be run to make a difference?


No, I'm not actually trying to turn off using UTF-8. I'm just trying
to find out how things work (what actually is affected by the locale
settings).


Actually, what I really want to know is how to revert the sorting of
file names from ls (and Emacs dired listings) from the order caused
by having "en_US" in LANG=en_US.UTF-8 back to the traditional (old)
Unix order (e.g., what LANG=C would yield) without messing up all the
UTF-8 support that's all over Linux now.


First of all, can UTF-8 be combined with the "C" locale as in
LANG=C.UTF-8?

Do I probably want something closer to LANG=en_US.UTF-8 LC_COLLATE=C
(in order to reduce the amount of locale settings I'm overriding)?




When you run sed/grep, the encoding will affect how it processes the
text.


Are you sure about sed?

I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether
the regular expression "[a-z]" matched "X". Grep seems to be
affected as expected, but sed never matched. (That's on Squeeze.)

Daniel




--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 50441FFC.7040507@kempt.net">http://lists.debian.org/50441FFC.7040507@kempt.net
 
Old 09-03-2012, 11:13 AM
Roger Leigh
 
Default What does charset in locale setting affect?

On Sun, Sep 02, 2012 at 11:11:56PM -0400, Dan B. wrote:
> Roger Leigh wrote:
> >On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote:
> >...
> >
> >>Which common programs (e.g., getty, xterm/etc., sed/grep?) do something
> >>different based on the charset portion of the local setting?
> >
> >All of them, in short.
> >
> >When you run a terminal emulator such as xterm, it will get the
> >encoding to use inside the emulator using nl_langinfo(3). ...
>
> What about the virtual consoles?

Virtual consoles are slightly different. Because they start up
/before/ you log in, they switch unicode mode on or off depending
on the default system locale (/etc/default/locale). See
unicode_start_stop in /etc/init.d/console-screen.kbd.sh. You can
switch them into unicode mode with unicode_start, which sends an
escape sequence to select the ISO-2022 UTF-8 charset.

> Whether I choose a default system locale of UTF-8 or None (in the
> dialog for "dpkg-reconfigure locales"), and log out and log in (to
> make sure the shell has a chance to get fresh settings), then
>
> echo $'xC2xA2'
>
> displays the same thing (the cent sign).

"None" might result in UTF-8 as a default. Try ISO-8859-1 to
explicitly specify a non-unicode locale. None that you'll
need to generate a suitable locale e.g. en_GB.ISO-8859-1 with
localegen/localedef.

> Is the virtual console supposed to follow the locale's character
> encoding? If so, does something else (e.g., something in /etc/init.d/)
> need to be run to make a difference?

/etc/init.d/console-screen.kbd.sh as above.

> Actually, what I really want to know is how to revert the sorting of
> file names from ls (and Emacs dired listings) from the order caused
> by having "en_US" in LANG=en_US.UTF-8 back to the traditional (old)
> Unix order (e.g., what LANG=C would yield) without messing up all the
> UTF-8 support that's all over Linux now.

> First of all, can UTF-8 be combined with the "C" locale as in
> LANG=C.UTF-8?

Yes (and no). You can certainly generate such a locale. In fact, I'm
a strong proponent of having a C.UTF-8 locale as the default locale
in glibc. However, right now if you generate it (which is possible),
it's not completely compatible with a real C locale (i.e. conformant
with the C and POSIX standards). Hopefully this will be the case in
the future.

> Do I probably want something closer to LANG=en_US.UTF-8 LC_COLLATE=C
> (in order to reduce the amount of locale settings I'm overriding)?

Just set LC_COLLATE=C. So you keep the UTF-8 LC_CTYPE, but the sort
order is taken from C. However, this will likely miss-sort any
character outside the ASCII range, since C is a 7-bit ASCII locale.
[Note: you probably do not want this!] In general, I would advise
using the default collation for your locale, though in code it's
common to switch to C for locale-independent sorting.

> >When you run sed/grep, the encoding will affect how it processes the
> >text.
>
> Are you sure about sed?
>
> I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether
> the regular expression "[a-z]" matched "X". Grep seems to be
> affected as expected, but sed never matched. (That's on Squeeze.)

It's the same version in wheezy, so I would not expect a change here.
I'm not sure how [a-z] matches--I'd have to check if it's locale-
independent. In general, I'd use POSIX character classes like
[:alpha:], [:upper:] and [:lower:] to work properly in all locales.


Regards,
Roger

--
.'`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' schroot and sbuild http://alioth.debian.org/projects/buildd-tools
`- GPG Public Key F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20120903111323.GI3198@codelibre.net">http://lists.debian.org/20120903111323.GI3198@codelibre.net
 
Old 09-03-2012, 04:46 PM
Tom H
 
Default What does charset in locale setting affect?

On Sun, Sep 2, 2012 at 11:11 PM, Dan B. <danb@kempt.net> wrote:
>
> Are you sure about sed?
>
> I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether
> the regular expression "[a-z]" matched "X". Grep seems to be
> affected as expected, but sed never matched. (That's on Squeeze.)

What commands dud you use?!


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAOdo=Swrjr-sc2XVdu3DTmHV-r7LTDFRg4HspRcpZDdgbUFAQQ@mail.gmail.com
 

Thread Tools




All times are GMT. The time now is 09:42 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org