Re: commit: abi: UTF8String class

From: phearbear (
Date: Sun Apr 21 2002 - 00:13:01 EDT

  • Next message: Andrew Dunbar: "Re: commit: abi: UTF8String class"

    Andrew Dunbar wrote:

    > --- Tomas Frydrych <>
    >wrote: >
    >>>Andrew Dunbar <> wrote:
    >>>Well pretty soon we're going to need a real
    >>>replacement. Dom and I are both in favour of the
    >>>replacement being UTF-8 but some here seem to want
    >>UTF-8 is an encoding scheme that is intended to
    >>allow Unicode
    >>communication between separate processes over 8-bit
    >>For that it is great, but that's about the only
    >>thing it is really good
    >>for. UTF-8 processing is cumbersome, and as such it
    >>is completely
    >>unsuitable format to use for the piecetable. We need
    >>a fixed with
    >>encoding for that, such as the curent UCS-2, i.e.,
    >Please back up these comments. A lot of people,
    >they are familiar with Unicode and UTF-8 seem to think
    >this. I did too. Then I read reams and reams of
    >newsgroups and mailing lists and FAQs. Now I know why
    >Qt, GTK, QNX, and others use UTF-8 internally.
    >People seem to think that because UTF-8 encodes
    >characters as variable length runs of bytes that this
    >is somehow computationally expensive to handle. Not
    >so. You can use existing 8-bit string functions on
    >It is backwards compatible with ASCII. You can scan
    >forwards and backwards effortlessly. You can always
    >tell which character in a sequence a given byte
    >belongs to.
    >People think random access to these strings using
    >array operator will cost the earth. Guess what - very
    >little code access strings as arrays - especially in
    >a Word Processor. Of the code which does, very little
    >of that needs to. Even when you do perform lots of
    >array operations on a UTF-8 string, people have done
    >extensive tests showing that the cost is extremely
    >negligable - look in the Unicode literature and you
    >will find all this information.
    >People think that UCS-2, UTF-16, or UTF-32 mean we can
    >have perfect random access to strings because a
    >characters is always represented as a single word or
    >longword. Not so. UCS-2 should but this term is
    >often (by Microsoft) used to refer to UTF-16. UTF-16
    >uses a mechanism called "surrogates" whereby a single
    >character may need two words to represent it. There
    >goes your free array access. Even UTF-32 is not safe
    >from this. Because Unicode requires "combining
    >characters". This means that "" may be represented
    >as "a" followed by a non-spacing "" acute accent.
    >Some people think this is also silly. These people
    >need to go read all about Unicode before they embark
    >on seriously multilingual software. Vietnames is
    >possible to support without combining characters but
    >you won't be able to view the results because no
    >Vietnames fonts exist that work this way - they all
    >expect to use combining characters. Thai needs them.
    >Hindi needs them. All Indian/Indic languages need
    >So to sum up, the two arguments not to use UTF-8
    >internally are:
    >1) Array access is too slow.
    >- This is not true and it is seldom needed.
    >2) UTF-8 means you have to handle a series of values
    > for a single on-screen character.
    >- *All* Unicode encodings need this anyway!
    >But look around the internet for better arguments and
    >better written arguments.
    >Andrew Dunbar.
    >Do You Yahoo!?
    >Everything you'll ever need on one web page
    >from News and Sport to Email and Music Charts

    Excuse my lazyness, but scanning through all isn't really
    what i like to spend my week on ;) Any special articles you recommend us
    to read?


    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 00:15:21 EDT