Re: Implementing support for barbarisms correction

From: Andrew Dunbar (
Date: Sun Sep 22 2002 - 00:58:38 EDT

     --- Jordi Mas <> wrote: > Hello,
    > A common problem in Catalan language are barbarisms,
    > they basically words that are incorrect but that are
    > widely use. One common reason for this to happen,
    > is because the areas were Catalan is spoken usually
    > people also speak French or Spanish, then people
    > easily borrow words from other language to other.
    > In Catalan, for example, there may glossaries of
    > barbarisms, they usually have listed the incorrect
    > word, the barbarism, and a proper replacement. For
    > example "tamany, mida". Tamany is borrowed from
    > Spanish "tamaño" but in Catalan is incorrect, the
    > proper word is "mida", that means "size". For spell
    > checking programs since "tamany" and "mida" are very
    > different words they just cannot make a good
    > suggestion, because this is not a typo, is just an
    > incorrect words been used.
    > Well, coming back to Abiword. I have been thinking
    > of implementing an optional barbarism file for every
    > language, if the file is present is used, if not
    > nothing happens. It is just a list of incorrect
    > words and they correct replacemnt.
    > Does anybody have a problem with me implementing
    > this? Any other languages were this can be useful?

    I think that most languages have a concept like this.
    English is probably largely the exception. Languages
    with active "police" or language academies could
    certainly benefit from this and probably already have
    word lists around. French comes to mind straight
    I think Serbian, Croatian, and Bosnian are now under-
    going a process of seperating out the vocabularies in
    a way which would be compatible with this too.

    It ought to be part of the spelling/grammar/style
    infrastructure (not there's a whole lot of
    infrastructure yet), it should be extensible in a
    similar way to user dictionaries, files should have
    language tag names such as "ca.barb" - and try to
    avoid over-specific tags such as "ca-ES" if the same
    rules are applicable to say France or Andorra.
    Files should either always be in UTF-8 or specify
    encoding as part of their format. A very simple XML
    format is preferred.
    In fact maybe it should be part of the grammar-checker
    world with a separate option switch. This would make
    it easy to share say green squiggle underlines with
    the grammar checker.
    If adding it to the spelling-checker makes more sense
    that would also work.

    Andrew Dunbar.

    > Thanks!
    > --
    > Jordi Mas


