Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Gentoo Development (http://www.linux-archive.org/gentoo-development/)
-   -   LANG=en_GB.UTF-8 by default (http://www.linux-archive.org/gentoo-development/633227-lang-en_gb-utf-8-default.html)

"Francesco R.(vivo)" 02-15-2012 10:58 AM

LANG=en_GB.UTF-8 by default
 
as subject says could gentoo change the policy and set an UTF-8 environment by
default?

http://www.gentoo.org/doc/en/utf-8.xml how to do it very well but having it
already set could have the following two advantages:

1) well utf-8 is everywhere, even the linux weekly newsletter has it in 2012
2) the user need to change, not to create a /etc/env.d/XX-lc, creating a
standard place where every gentoo install has this settings.

contra?

P.S. would be nice to have a wd_WD.UTF-8 with WD standing for world, just a
country is so 1900

"Mr. Aaron W. Swenson" 02-15-2012 11:22 AM

LANG=en_GB.UTF-8 by default
 
On Wed, Feb 15, 2012 at 12:58:52PM +0100, Francesco R.(vivo) wrote:
> as subject says could gentoo change the policy and set an UTF-8 environment by
> default?
>
> http://www.gentoo.org/doc/en/utf-8.xml how to do it very well but having it
> already set could have the following two advantages:
>
> 1) well utf-8 is everywhere, even the linux weekly newsletter has it in 2012
> 2) the user need to change, not to create a /etc/env.d/XX-lc, creating a
> standard place where every gentoo install has this settings.
>
> contra?
>
> P.S. would be nice to have a wd_WD.UTF-8 with WD standing for world, just a
> country is so 1900
>

wd_WD.UTF-8 is certainly a no go. WD doesn't match any ISO country
code. To support it, we'd have to create the necessary supporting
files and that would lead to a lot of work and headaches trying to
determine what should be where in what order, et cetera.

All of the files we create (ebuilds, initscripts) are UTF-8 in
accordance with GLEP 31. So, the issue would be with upstream projects
not using UTF-8 for their files.

However, the stage 3, last time I used it, didn't default to a UTF-8
environment, and it didn't default to using and/or including a capable
UTF-8 font. It is something I think we should look at changing.

--
Mr. Aaron W. Swenson
Gentoo Linux
Developer, Proxy Committer
Email : titanofold@gentoo.org
GnuPG FP : 2C00 7719 4F85 FB07 A49C 0E31 5713 AA03 D1BB FDA0
GnuPG ID : D1BBFDA0

Kerin Millar 02-18-2012 01:31 AM

LANG=en_GB.UTF-8 by default
 
On 15/02/2012 12:22, Mr. Aaron W. Swenson wrote:

On Wed, Feb 15, 2012 at 12:58:52PM +0100, Francesco R.(vivo) wrote:

as subject says could gentoo change the policy and set an UTF-8 environment by
default?


Perhaps it should define LANG="en_US.UTF-8" as a reasonable default,
which would be in line with other notable distros. Arch also used to
define LC_COLLATE="C" by default, probably to mitigate unpredictable
behaviour in some applications, but have since dropped this additional
variable so they must have deemed it no longer necessary.


I think that having a default configuration file would also raise
awareness of the importance of locale configuration and make it less
likely that users configure their systems inappropriately (defining
LC_ALL, for instance).



P.S. would be nice to have a wd_WD.UTF-8 with WD standing for world, just a
country is so 1900


Different countries/regions have different standards and conventions for
character classification, case conversion, date/numerical/currency
formatting etc. There's no basis on which to formally standardise a
world-wide definition.






However, the stage 3, last time I used it, didn't default to a UTF-8
environment, and it didn't default to using and/or including a capable
UTF-8 font. It is something I think we should look at changing.



Yet "unicode" is a default flag in the standard profiles. Most console
fonts have poor coverage. The best one I've found thus far is
"LatCyrGr-16" from fonty-rg, which provides good Latin and Cyrillic
coverage along with some Greek and esoteric punctuation characters.
Using this font, I've yet to find any developer's name that doesn't
render as expected while perusing the contents of the portage tree.


Being a 512 character font, one loses bold support unless using a
framebuffer console. Given that the default console fonts aren't
especially useful, it seems a small price to pay.


--Kerin

James Cloos 02-19-2012 12:00 AM

LANG=en_GB.UTF-8 by default
 
>>>>> "KM" == Kerin Millar <kerframil@gmail.com> writes:

KM> Arch also used to define LC_COLLATE="C" by default, probably to
KM> mitigate unpredictable behaviour in some applications, but have
KM> since dropped this additional variable so they must have deemed it
KM> no longer necessary.

Without LC_COLLATE="C" things like [a-z]* gets a false=positive match
on files like Makefile.

I recently noticed a bug on b.g.o where the ebuild has something like
doc/[A-Z]* expecting that it will not match doc/some_lowercase_subdir.

The bug, of course, is that glibc fraudulently defaults the latin, greek
and cyrillic locales to case-insensitive.

The real fix is to have root be C.UTF-8. Which differs from C only in
that the charset is utf-8.

-JimC
--
James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

Ben 02-19-2012 01:04 AM

LANG=en_GB.UTF-8 by default
 
On 19 February 2012 09:00, James Cloos <cloos@jhcloos.com> wrote:
> Without LC_COLLATE="C" things like [a-z]* gets a false=positive match
> on files like Makefile. [...]
>
> The real fix is to have root be C.UTF-8. *Which differs from C only in
> that the charset is utf-8.

In my opinion we should set a default environment with the following values:

LANG=en_US.UTF-8
LC_ALL=
LC_COLLATE=C

This offers the best default options to the majority of users, and is
easy to customize for those who wish to use another locale.

And yes, LC_ALL needs to be empty, because it would override the other
LC_* values.

This should be combined with some good unicode fonts, such as the
LatCyrGr-16 for console, and dejavu for X.

Cheers,
Ben

Amadeusz Żołnowski 02-19-2012 10:39 AM

LANG=en_GB.UTF-8 by default
 
Excerpts from Ben's message of 2012-02-19 03:04:19 +0100:
> On 19 February 2012 09:00, James Cloos <cloos@jhcloos.com> wrote:
> > Without LC_COLLATE="C" things like [a-z]* gets a false=positive
> > match on files like Makefile. [...]
> >
> > The real fix is to have root be C.UTF-8. *Which differs from C only
> > in that the charset is utf-8.
>
> In my opinion we should set a default environment with the following
> values:
>
> LANG=en_US.UTF-8
> LC_ALL=
> LC_COLLATE=C

This is only on my setups or this is "xy_XY.utf8" instead of
"xy_XY.UTF-8"?


--
Amadeusz Żołnowski

Ulrich Mueller 02-19-2012 02:14 PM

LANG=en_GB.UTF-8 by default
 
>>>>> On Sun, 19 Feb 2012, Ben wrote:

> In my opinion we should set a default environment with the following
> values:

> LANG=en_US.UTF-8
> LC_ALL=
> LC_COLLATE=C

> This offers the best default options to the majority of users, and
> is easy to customize for those who wish to use another locale.

At least, LC_NUMERIC=C should be added to this, otherwise numbers will
be formatted with commas as thousands separators.

Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
and letter paper, which isn't optimal for users outside of the U.S.

Ulrich

Ben 02-19-2012 02:56 PM

LANG=en_GB.UTF-8 by default
 
On 19 February 2012 23:14, Ulrich Mueller <ulm@gentoo.org> wrote:
>>>>>> On Sun, 19 Feb 2012, Ben *wrote:
>
>> In my opinion we should set a default environment with the following
>> values:
>
>> LANG=en_US.UTF-8
>> LC_ALL=
>> LC_COLLATE=C
>
>> This offers the best default options to the majority of users, and
>> is easy to customize for those who wish to use another locale.
>
> At least, LC_NUMERIC=C should be added to this, otherwise numbers will
> be formatted with commas as thousands separators.
>
> Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
> and letter paper, which isn't optimal for users outside of the U.S.
>
> Ulrich
>

I think those users (and that includes myself) should then set LANG to
something more appropriate to their use case.

Ben

Kerin Millar 02-19-2012 05:44 PM

LANG=en_GB.UTF-8 by default
 
On 19/02/2012 15:56, Ben wrote:

On 19 February 2012 23:14, Ulrich Mueller<ulm@gentoo.org> wrote:

On Sun, 19 Feb 2012, Ben wrote:



In my opinion we should set a default environment with the following
values:



LANG=en_US.UTF-8
LC_ALL=


LC_ALL isn't needed here because, unlike other LC_* settings, it does
not inherit from LANG and, thus, will be undefined anyway. Although the
above would not directly cause any harm, I am entirely certain that its
mere presence would encourage users to explicitly define it where they
most definitely should not. The misinformation that LC_ALL should be
defined was propagated by the localization doc for rather a long time
and it was rather challenging to impress upon its maintainers that
change was required. Let's not repeat old mistakes.



LC_COLLATE=C



This offers the best default options to the majority of users, and
is easy to customize for those who wish to use another locale.


At least, LC_NUMERIC=C should be added to this, otherwise numbers will
be formatted with commas as thousands separators.

Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units
and letter paper, which isn't optimal for users outside of the U.S.

Ulrich



I think those users (and that includes myself) should then set LANG to
something more appropriate to their use case.



I agree; the defaults should not be over-engineered. For proper
localisation, set LANG appropriately and done. The real issue is that
locale configuration isn't mentioned in the handbook. It does, however,
mention locale.gen so we're half-way there.


--Kerin

Kerin Millar 02-19-2012 06:14 PM

LANG=en_GB.UTF-8 by default
 
On 19/02/2012 01:00, James Cloos wrote:

"KM" == Kerin Millar<kerframil@gmail.com> writes:


KM> Arch also used to define LC_COLLATE="C" by default, probably to
KM> mitigate unpredictable behaviour in some applications, but have
KM> since dropped this additional variable so they must have deemed it
KM> no longer necessary.

Without LC_COLLATE="C" things like [a-z]* gets a false=positive match
on files like Makefile.


Indeed, character classes are a potential minefield. Incidentally, I
just tested Ubuntu and Arch with only LANG set to a UTF-8 locale:-


$ echo Makefile | sed -re 's/[a-z]//g' # collation rules ignored
M

$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored
akefile

In neither case are the collation rules being obeyed. In Gentoo, however:-

$ echo Makefile | sed -re 's/[a-z]//g' # collation rules obeyed

$ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored
akefile

Obeying the collation rules is ostensibly the correct thing to do but,
until everyone starts using named character classes (which will never
happen), it's not safe. The thing that worries me here is the
inconsistency in Gentoo. LC_COLLATE="C" is sufficient to work around the
issue but the above makes me wonder why we still need it.


--Kerin


All times are GMT. The time now is 08:06 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.