Unicode, fontsets, international text et al

Subject: Unicode, fontsets, international text et al
From: Petr Tesarik (tesarik@lupa.cz)
Date: Tue May 30 2000 - 08:27:36 CDT

Hi folks!

I noticed the placeholder in the Font selection dialog. So, frankly, I
don't understand why AbiWord is keeping text in Unicode
internally... Why do I care? I want to make AbiWord more usable for
people that use non-Latin1 encodings under XWindow. I found a patch to
enable XIM in the archive (why is it not included in CVS??). That
caused my keysyms map correctly into Unicode. So, I can write those
"exotic" characters like rcaron and they map correctly to Unicode and
store somewhere as e.g. 0x159 (Unicode). But I can't see them
onscreen. :(

I analyzed the problem a bit:

* You use gdk_draw_text_wc() in gr_UnixGraphics.cpp - this would
  correctly output Unicode characters if the font were a fontset.

* You incorrectly use gdk_font_load() in xap_UnixFont.cpp. If you want
  the wide character string output functions to work, you must use

* Even using gdk_fontset_load() won't work, since myxlfd.getXLFD()
  returns the string specification _including_ character encoding.

This is WRONG. We need to construct a fontset that includes all
supported encodings. This means the XLFD should have '*' for both
registry and encoding (or maybe 'iso8859-*' would be sufficient for
the beginning). That's not really hard (though I didn't find a way to
achieve it yet), the hard point is the lack of appropriate fonts. I do
have some Type-1 fonts for Latin-2 encoding but they seem to be too
different (even though both claim to be 'Times New Roman'). Why that
matters? Let's have a Czech word: řeč. The two characters
ř and &#x10d map to Latin-2 font but the 'e' in the middle is
mapped to Latin-1 font (since it in reality is a Latin-1 characters as
well as a Latin-2 one). If we keep things in Unicode internally, we
never know which charset it belongs to, if there are more than one
choice. If we had a consistent fontset (i.e. one Unicode character
would be the same in all applicable encodings), it wouldn't
matter. But hey folks, we might not be able to get a (reasonably)
complete Unicode fontset!

The question is: do we want to keep the Unicode scheme (and I'd love
it to be that way) or do we rather implement different encodings?
(BTW Dingbats and Symbols do not insert correct Unicode anyway, so we
currently accept sort of encoding specifics - dingbats surely don't
say e.g. "=Hln" when I see cross, star, circle, square :). But it's
being saved that way...

Sorry for writing such a long e-mail, but the international support is
really a pain in the butt and I'm looking for a comprehensive
solution. I am stuck:

 * I can try to create an AbiWord-compatible Latin-2 font - but I am
   an absolute amateur, so I doubt it will be usable for professional

 * I can use existing fonts - but they don't match the ones that
   AbiWord uses. I would have to break AbiWord into two distributions
   - one that would work fine in Latin-2 environment but not so fine
   with Latin-1 and one one that would work precisely the opposite
   way. That sucks. :( Mixing different fonts together
   looks... terrible.

Any idea welcome,

Petr Tesarik
Tel: +420 602 575294            http://www.lupa.cz/

This archive was generated by hypermail 2b25 : Tue May 30 2000 - 08:27:41 CDT