From: phearbear (email@example.com)
Date: Sun Apr 21 2002 - 00:13:01 EDT
Andrew Dunbar wrote:
> --- Tomas Frydrych <firstname.lastname@example.org>
>>>Andrew Dunbar <email@example.com> wrote:
>>>Well pretty soon we're going to need a real
>>>replacement. Dom and I are both in favour of the
>>>replacement being UTF-8 but some here seem to want
>>UTF-8 is an encoding scheme that is intended to
>>communication between separate processes over 8-bit
>>For that it is great, but that's about the only
>>thing it is really good
>>for. UTF-8 processing is cumbersome, and as such it
>>unsuitable format to use for the piecetable. We need
>>a fixed with
>>encoding for that, such as the curent UCS-2, i.e.,
>Please back up these comments. A lot of people,
>they are familiar with Unicode and UTF-8 seem to think
>this. I did too. Then I read reams and reams of
>newsgroups and mailing lists and FAQs. Now I know why
>Qt, GTK, QNX, and others use UTF-8 internally.
>People seem to think that because UTF-8 encodes
>characters as variable length runs of bytes that this
>is somehow computationally expensive to handle. Not
>so. You can use existing 8-bit string functions on
>It is backwards compatible with ASCII. You can scan
>forwards and backwards effortlessly. You can always
>tell which character in a sequence a given byte
>People think random access to these strings using
>array operator will cost the earth. Guess what - very
>little code access strings as arrays - especially in
>a Word Processor. Of the code which does, very little
>of that needs to. Even when you do perform lots of
>array operations on a UTF-8 string, people have done
>extensive tests showing that the cost is extremely
>negligable - look in the Unicode literature and you
>will find all this information.
>People think that UCS-2, UTF-16, or UTF-32 mean we can
>have perfect random access to strings because a
>characters is always represented as a single word or
>longword. Not so. UCS-2 should but this term is
>often (by Microsoft) used to refer to UTF-16. UTF-16
>uses a mechanism called "surrogates" whereby a single
>character may need two words to represent it. There
>goes your free array access. Even UTF-32 is not safe
>from this. Because Unicode requires "combining
>characters". This means that "á" may be represented
>as "a" followed by a non-spacing "´" acute accent.
>Some people think this is also silly. These people
>need to go read all about Unicode before they embark
>on seriously multilingual software. Vietnames is
>possible to support without combining characters but
>you won't be able to view the results because no
>Vietnames fonts exist that work this way - they all
>expect to use combining characters. Thai needs them.
>Hindi needs them. All Indian/Indic languages need
>So to sum up, the two arguments not to use UTF-8
>1) Array access is too slow.
>- This is not true and it is seldom needed.
>2) UTF-8 means you have to handle a series of values
> for a single on-screen character.
>- *All* Unicode encodings need this anyway!
>But look around the internet for better arguments and
>better written arguments.
>Do You Yahoo!?
>Everything you'll ever need on one web page
>from News and Sport to Email and Music Charts
Excuse my lazyness, but scanning through all unicode.org isn't really
what i like to spend my week on ;) Any special articles you recommend us
This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 00:15:21 EDT