FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Development

 
 
LinkBack Thread Tools
 
Old 02-11-2011, 10:12 PM
Ron Johnson
 
Default Make Unicode bugs release critical?

On 02/11/2011 07:36 AM, Adam Borowski wrote:
[snip]


UTF-16 is never, ever useful. It is a sad trap for win32 and Java
developers, due to a bad engineering decision suggested, as I was told, by

[snip]


No, there is only one encoding left, as long as you don't have to talk to
Windows.


Never useful except for 90% of the market? (I wonder how SAMBA
deals with it...)


--
"The normal condition of mankind is tyranny and misery."
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D55C263.90307@cox.net">http://lists.debian.org/4D55C263.90307@cox.net
 
Old 02-11-2011, 11:33 PM
Peter Samuelson
 
Default Make Unicode bugs release critical?

[Ron Johnson]
> Never useful except for 90% of the market? (I wonder how SAMBA deals
> with it...)

I don't think you really want to know. There's a 'unicode' flag in
much of the CIFS protocol that means filenames and such are in UTF-16
(I think UTF-16LE) instead of some-random-configured-code-page.
Samba's been using that flag for about 10 years. You configure it to
say what encoding your filenames are supposed to be on the server, and
it expresses them in UTF-16 on the wire.

Samba also supports non-Unicode-aware clients like Windows 3.11 - or at
least it used to support these - you'd tell Samba what client code page
to translate your filenames into on the wire. Fun stuff.

Samba doesn't really deal with file _contents_, which is a much more
"interesting" problem than filenames. It just serves contents as-is,
like most file service protocols other than FTP.
--
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110212003312.GB10272@p12n.org">http://lists.debian.org/20110212003312.GB10272@p12n.org
 
Old 02-12-2011, 01:02 AM
Adam Borowski
 
Default Make Unicode bugs release critical?

On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote:
> On Fri, 11 Feb 2011, Lars Wirzenius wrote:
> > However, I'm curious: is there a lot of software that is broken with
> > Unicode, particularly with the UTF-8 encoding? I can't remember anything
> > much in recent times.
>
> 2. Anything that cannot deal with Supplementary planes.
>
> This includes the use of UCS-2 instead of UTF-16, as it cannot represent
> the Supplementary planes. python 3 when not compiled to use UCS-4 memory
> hog mode is an example, I am told.

Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient
charset. Using either UTF-16 or UCS-4 can be a memory hog, that's why to
pick UTF-8 for regular use. Except for some rare cases (CJK with no
formatting or markup), it uses less memory and can be passed as-is to POSIX
file functions.

Picking a random subset of Unicode is like putting day-of-the-year in one
byte variable since this way you support 70% of uses and it conserves
memory...

--
1KB // Microsoft corollary to Hanlon's razor:
// Never attribute to stupidity what can be
// adequately explained by malice.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110212020220.GA26308@angband.pl">http://lists.debian.org/20110212020220.GA26308@angband.pl
 
Old 02-12-2011, 02:55 AM
Henrique de Moraes Holschuh
 
Default Make Unicode bugs release critical?

On Sat, 12 Feb 2011, Adam Borowski wrote:
> On Fri, Feb 11, 2011 at 08:16:54PM -0200, Henrique de Moraes Holschuh wrote:
> > 2. Anything that cannot deal with Supplementary planes.
> >
> > This includes the use of UCS-2 instead of UTF-16, as it cannot represent
> > the Supplementary planes. python 3 when not compiled to use UCS-4 memory
> > hog mode is an example, I am told.
>
> Using UCS-2 is hardly better than using ISO-8859-1 or any other ancient
> charset. Using either UTF-16 or UCS-4 can be a memory hog, that's why to
> pick UTF-8 for regular use. Except for some rare cases (CJK with no

Python 3 uses UCS-2 (or UCS-4) for the internal representation. Likely
they wanted to have something that made it easy to address each
character in an Unicode string in O(1).

That might actually give better performance given how much people like
to do string slicing and splicing in python. The O(N) often required by
UTF-8 and UTF-16 might well be more painful than the much larger data
cache footprint of UCS-4... but that is a damn big *maybe*, and very
unlikely to be consistent across very different architectures.

Well, not like I care. I don't even have Python 3 installed, and I will
only do so the day something I need decides to pull it as a dependency.

> Picking a random subset of Unicode is like putting day-of-the-year in one

UCS-2 is deprecated as all heck. As far as I could research through
Google, it is not a valid Unicode representation since Unicode 2.0 (i.e.
1996). So it wouldn't even count as a "random subset of Unicode".

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110212035533.GA32574@khazad-dum.debian.net">http://lists.debian.org/20110212035533.GA32574@khazad-dum.debian.net
 
Old 02-14-2011, 08:57 PM
Ron Johnson
 
Default Make Unicode bugs release critical?

On 02/14/2011 10:39 AM, Ian Jackson wrote:
[snip]


The fact that naive Python programs work (honouring LC_CTYPE as they
should) unless you pipe their output to something is clearly a bug.
The fact that it's a specification bug doesn't mean it's not a bug.



It doesn't seem to work for me.

$ python -V
Python 2.6.6

$ LC_CTYPE=en_GB.utf-8 python -c 'print u"u00a3"'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xa3' in
position 0: ordinal not in range(128)


$ LC_CTYPE=en_GB.utf-8 python -c 'print u"uc2a3"'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'uc2a3'
in position 0: ordinal not in range(128)


$ perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 51 registered patches, see perl -V for more detail)

$ LC_CTYPE=en_GB.utf-8 perl -e 'print "x{00a3}
";'
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "en_GB.utf-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").


$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


--
"The normal condition of mankind is tyranny and misery."
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D59A558.5020209@cox.net">http://lists.debian.org/4D59A558.5020209@cox.net
 
Old 02-14-2011, 09:26 PM
The Fungi
 
Default Make Unicode bugs release critical?

On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote:
> It doesn't seem to work for me.
[...]
> $ LC_CTYPE=en_GB.utf-8 python -c 'print u"u00a3"'
> Traceback (most recent call last):
> File "<string>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'xa3' in
> position 0: ordinal not in range(128)
[...]
> $ LC_CTYPE=en_GB.utf-8 perl -e 'print "x{00a3}
";'
> perl: warning: Setting locale failed.
> perl: warning: Please check that your locale settings:
> LANGUAGE = (unset),
> LC_ALL = (unset),
> LC_CTYPE = "en_GB.utf-8",
> LANG = "en_US.UTF-8"
> are supported and installed on your system.
> perl: warning: Falling back to the standard locale ("C").
[...]

You probably don't have an en_GB.utf-8 locale (maybe you have
localepurge installed?). I bet en_US.utf-8 will net you different
results.
--
{ IRL(Jeremy_Stanley); WWW(http://fungi.yuggoth.org/); PGP(43495829);
WHOIS(STANL3-ARIN); SMTP(fungi@yuggoth.org); FINGER(fungi@yuggoth.org);
MUD(kinrui@katarsis.mudpy.org:6669); IRC(fungi@irc.yuggoth.org#ccl);
ICQ(114362511); YAHOO(crawlingchaoslabs); AIM(dreadazathoth); }


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110214222615.GD1293@yuggoth.org">http://lists.debian.org/20110214222615.GD1293@yuggoth.org
 
Old 02-14-2011, 11:10 PM
Ron Johnson
 
Default Make Unicode bugs release critical?

On 02/14/2011 04:26 PM, The Fungi wrote:

On Mon, Feb 14, 2011 at 03:57:44PM -0600, Ron Johnson wrote:

It doesn't seem to work for me.

[...]

$ LC_CTYPE=en_GB.utf-8 python -c 'print u"u00a3"'
Traceback (most recent call last):
File "<string>", line 1, in<module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xa3' in
position 0: ordinal not in range(128)

[...]

$ LC_CTYPE=en_GB.utf-8 perl -e 'print "x{00a3}
";'
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "en_GB.utf-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

[...]

You probably don't have an en_GB.utf-8 locale (maybe you have
localepurge installed?). I bet en_US.utf-8 will net you different
results.


That's it...

$ LC_CTYPE=en_US.utf-8 python -c 'print u"u00a3"'


$ LC_CTYPE=en_US.utf-8 perl -e 'print "x{00a3}
";'


No localepurge, but when initially building the system, I only
installed one or two locales.


--
"The normal condition of mankind is tyranny and misery."
Milton Friedman


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D59C47D.7060302@cox.net">http://lists.debian.org/4D59C47D.7060302@cox.net
 
Old 02-14-2011, 11:27 PM
Adam Borowski
 
Default Make Unicode bugs release critical?

On Mon, Feb 14, 2011 at 06:10:37PM -0600, Ron Johnson wrote:
> On 02/14/2011 04:26 PM, The Fungi wrote:
> >You probably don't have an en_GB.utf-8 locale (maybe you have
> >localepurge installed?). I bet en_US.utf-8 will net you different
> >results.
>
> That's it...
>
> No localepurge, but when initially building the system, I only
> installed one or two locales.

No one would expect an USian to use a GB locale.

The problem is, there is currently no way to request UTF-8 encoding without
specifying language. It's a remnant of ancient locales where ISO-8859-1
didn't make sense for pl_PL nor ISO-8859-2 for fr_FR.

Also, iconv() functions are really inconvenient to use, it'd be much easier
to use regular wide char functions predictably.

In other words: can I has C.UTF-8 guaranteed?

--
1KB // Microsoft corollary to Hanlon's razor:
// Never attribute to stupidity what can be
// adequately explained by malice.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110215002700.GA15793@angband.pl">http://lists.debian.org/20110215002700.GA15793@angband.pl
 

Thread Tools




All times are GMT. The time now is 07:43 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org