Word Breaking

From: Jordi Mas (jmas@softcatala.org)
Date: Mon Oct 27 2003 - 16:31:06 EST

  • Next message: Dom Lachowicz: "Re: Word Breaking"


    I may not be very accurate, I just want to bring attention in a problem that
    we had around for a few time.

    We had an unsolved issue for a long time. It is proper word breaking. Right
    now, we are just assuming that words break as they basically do in English
    language. However, this is not true for many languages.

    This has to problems:

    - Word Counting is not accurate for these languages
    - Spellcheckers do not work properly (since words sent to them are not really

    For example, in Catalan you can write "il·lusion" (for ilLusion). The '·'
    character is not considered to be part of the word. Unfortunally, this word is
    counted as two word plus always thread as two separate words by the spell checker.

    One simple idea to fix this problem will be to extend the system.profiles
    files. We are already storing some parameters related to the language like
    'DefaultDirectionRtl'. We can add entries to indicate which characters can be
    part of a word.

    What you guys think?



    Jordi Mas i Hernāndez (homepage http://www.softcatala.org/~jmas) http://www.softcatala.org

    This archive was generated by hypermail 2.1.4 : Mon Oct 27 2003 - 16:32:26 EST