Re: commit (HEAD): IMPORTANT - 32-bit UT_UCSChar

From: F J Franklin (
Date: Wed May 08 2002 - 10:34:21 EDT

  • Next message: Scott Rushfeldt: "Re: IMPORTANT: proposed removal of non-bidi code"

    > > Support is there but incomplete. Byte sequences
    > > longer 3 bytes will cause
    > > problems, and there isn't a UTF-8 -> UCS-4
    > > conversion yet.
    > Sorry to keep whining about this but it was all in my lost huge Unicode
    > patch over a year ago. UTF-8 sequences can be up to 6 bytes long. We
    > should probably leave it up to iconv anyway since we have to handle
    > things like overlong sequences, illegal sequences etc. iconv should
    > handle this. I think my implementation used the ByteBuf class so that
    > it could handle UCS-2 and UCS-4 properly without worrying about all
    > those null bytes looking like string terminators and stuff.

    Andrew, Andrew, I know. The reason why only 3-byte sequences are handled
    is that the routine was written to convert Abi's internal UCS-2. Now that
    Abi uses UCS-4 internally I'll add the code to handle 6-byte sequences.

    In general I support the use of iconv for conversion between encodings,
    but conversion between validated UTF-8 and UCS-4 is trivial and the
    [UT_]UTF8String class was designed to handle the conversion without
    resorting to iconv.

    Ciao, Frank

    ps. BTW, do you know anything about the overheads of using various iconv
        implementations? or their thread-safety, for that matter? (Genuinely

    Francis James Franklin

    "No, she really likes me. She told me I look like Britney Spears, and why
    would you say that to somebody you don't like?"
                                                               --- Elle Woods

    This archive was generated by hypermail 2.1.4 : Wed May 08 2002 - 10:37:09 EDT