smart quote algorithm

Subject: smart quote algorithm
From: WJCarpenter (bill-abisource@carpenter.ORG)
Date: Thu Jul 20 2000 - 01:55:54 CDT

[[PREAMBLE: I'm currently hankering to get smart quotes right, but it
may not be worthy of the amount of mental effort the group as a whole
is putting into it over the last few days. Your advice, suggestions,
and criticisms are welcome, but I hope nobody is significantly
distracted from other features by pondering this. Some of those other
features bug me, too. :-)]]


Here is my suggestion for an algorithm for smart quote substitution.
The stuff below deals only with recognizing the situation in the
context of a static piece of text. I believe it is useful to first
get to an algorithm before worrying about the more subtle problems of
an "as you type" implementation. It also leaves aside for the moment
whether this is just a display/print remapping or a change to document

This algorithm is based on my observation of how people actually use
quotation marks, sometimes in contravention of generally accepted
principals of punctuation. It is certainly also true that my
observations are overwhelmingly of American English text, with a
smattering of various other languages observed from time to time. I
don't believe that any algorithm for this can ever be perfect. There
are too many infrequently-occurring but legitimate cases where a user
might want something else. FWIW, I haven't tested out the specifics
of the smart quote algorithm in ThatOtherWordProcessor.

Some terms for the purpose of this discussion (I'm open to plenty of
advice on what specific items should fit in each of these classes):

BREAK A structural break in a document. For example, a paragraph
  break, a column break, a page break, the beginning or end of a
  document, etc. Does not include font, size, bold/italic/underline
  changes (which are completely ignored for the purposes of this

PUNCT A subset of layman's "punctuation". I include only things that
  can normally occur after a quote mark with no intervening white
  space. Includes period, exclamation point, question mark,
  semi-colon, colon, comma, parentheses, square and curly brackets.
  There may be a few others that aren't on the kinds of keyboards I
  use, and there are certainly Latin1 and other locale-specific
  variants, but the point is that there are lots of random
  non-alphanumerics which aren't included in PUNCT for this algorithm.

ALPHA Alphabetic characters in the C isalpha() sense, but there are
  certainly some non-ASCII letter characters which belong in this
  bucket, too.

QUOTE Any of ASCII double quote, ASCII quote (which many people call
  the ASCII single quote or the ASCII apostrophe), or ASCII backquote.
  I take it as given that a significant minority of people randomly or
  systematically interchange their use of ASCII quote and ASCII
  backquote, so I treat them the same in the algorithm. The majority
  of people use ASCII quote for both opening and closing single quote.

PARITY Whether a quote is single or double. For ease of description,
  I'll say that the parity of single and double quotes are opposites
  of each other. When QUOTEs are converted to curly form, the parity
  never changes.


OK, first an easy exception case: If ASCII (single) quote (but not
ASCII backquote) appears between two ALPHAs, it may be treated as an
apostrophe and converted to its curly form. Otherwise, it is treated
like all other QUOTEs and follows the normal algorithm.

Given a QUOTE character, these conditions are logically tested in

1. If a QUOTE is immediately preceded by a curly quote of opposite
parity, it is converted to a curly quote in the same direction.

2. If a QUOTE is immediately preceded by a curly quote of the same
parity, it is converted to a curly quote of opposite direction.

3. If a QUOTE is immediately followed by a curly quote of opposite
parity, it is converted to a curly quote in the same direction.

4. If a QUOTE is immediately followed by a curly quote of the same
parity, it is converted to a curly quote of opposite direction.

[[The above cases are intended to handle normal nested quotes or cases
where quotes enclose empty strings. Different cultures use different
parities as start points for nested quotes, but the algorithm doesn't

5. If a QUOTE is in isolation, it is not converted. It is in
isolation if it is immediately preceded and followed by either a BREAK
or white space. The things before and after it don't have to be of
the same type.

6. If a QUOTE is immediately preceded by a BREAK or white space and
is immediately followed by anything other than a BREAK or white space,
it is converted to the opening form of curly quote.

7. If a QUOTE is immediately followed by a BREAK, white space, or
PUNCT and is immediately preceded by anything other than BREAK or
white space, it is converted to the closing form of curly quote.

8. Any other QUOTE is not converted.


The algorithm doesn't make a special case of using ASCII double quote
as an inches indicator (there are other uses, like lat/long minutes;
ditto for the ASCII quote) because it is tough to tell if some numbers
with an ASCII double quote after them are intended to be one of those
"other things" or is just the end of a very long quote. So, the
algorithm will be wrong sometimes in those cases.

It is otherwise sort of conservative, preferring to not convert things
it doesn't feel confident about. The reason for that is that there is
a contemplated on-the-fly conversion to smart quotes, but there is no
contemplated on-the-fly conversion to ASCII QUOTEs. So, if the
algorithm makes a mistake by not converting, the user can correct it
by directly entering the appropriate smart quote character or by
heuristically tricking AbiWord into converting it for him/her and then
fixing things up. (That heuristic step shouldn't be necessary, you
know, but I think we all use software for which we have become
accustomed to such things.)

What about the occasions when this algorithm (or any alternative
algorithm) makes a mistake and converts a QUOTE to the curly form when
it really isn't wanted, in a particular case, by the user? Although
the user can change it back, some contemplated implementation details
might run around behind the barn and re-convert it when the user isn't
looking. I think we need a mechanism for dealing with that, but I
want to save proposals for that to be separate from the basic

bill@carpenter.ORG (WJCarpenter)    PGP 0x91865119
38 95 1B 69 C9 C6 3D 25    73 46 32 04 69 D6 ED F3

This archive was generated by hypermail 2b25 : Thu Jul 20 2000 - 02:07:34 CDT