XML is a generalized Markup up Language for the exchange of information. It is generalized in that it allows users to define their own tags and thus there is a data definition table required to decode the data. This DTD defined using a pointer near the top of the file.
Historically XML was intended as a simplification of SGML (Standard Generalized Markup Language) and was based on earlier work on HTML. As compared to HTML it differs in that it can have custom definitions of tags and the tag structure has to be 'well formed' meaning that there must be a close tag for every open tag unless the tag is self closing. XML is also case sensitive.
All XML coding requires a that the character coding be defined. Be default all XML documents use UTF-8 encoding. Otherwise this must be specified at the beginning of the file as one of the examples below: (of course it is also legal to express UTF-8.)
<?xml encoding='UTF-8'?> <?xml encoding='EUC-JP'?>
All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.
XML is the language of choice for defining metadata. The main use of this is in OPF (Open eBook Package Files) and other documents.
XML Document Formats
In practice, XML has taken the lead in defining the structure of the source document for most modern books and other documents. There is an ongoing debate on XML based formats for Document exchange.
- ODF - The Oasis Open Document Format is an xml based format being proposed by several companies. The parent ODF committee has recently jumped ship in favor of the CDF format proposed by W3C. This format is backed by Sun, IBM and others. It encapsulates xml in a zip file to avoid large file sizes. This format uses .ODT as the file name extension. Open Office uses this format.
- CDF - The Compound Data Format is proposed as an xml format by the W3C committee that controls such important standards as html and xhtml. See http://www.w3.org/2004/CDF/
- CDFML - The Common Data Format XML exchange format is proposed by NASA http://cdf.gsfc.nasa.gov/ for the open exchange of documents.
- Microsoft Office Open XML - The exchange format being promoted for Document exchange. It is being used as a save format in Word 2007. The file is compressed by zip and used in its compressed form to save space. This file format uses a .DOCX extension for the file name.
- OSIS is an XML Schema definition for Bibles and other Biblical research texts. It finds its way into several Bible study tools.
- XPS and XML Paper specification used in Windows Vista as the printer spool format.
- The standardization effort in the International Community is contained in the XHTML 1.1 specification maintained by the International Digital Publishing Forum (<idpf>) See http://www.idpf.org/specs.htm. This standard defines the book data and also a container mechanism to hold all of the various pieces of a book called ePUB.
- RSS is a standardized format which is used as a distribution standard for many news releases and blogs on the Internet. As eBooks attempt to move into daily news reading the RSS format will become very important. It is also based on XML.
- The Russian eBook community uses XML to define the Fiction Book standard. For more information see http://www.fictionbook.org/index.php/Eng:FictionBook. The format is called FB2.
XML character entity references
Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
- & → & (ampersand, U+0026)
- < → < (less-than sign, U+003C)
- > → > (greater-than sign, U+003E)
- " → " (quotation mark, U+0022)
- ' → ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. However, use of ' in XHTML should generally be avoided for compatibility reasons as it was not defined for HTML. ' or ' may be used instead.
Here is the syntax for creating an ENTITY:
<!ENTITY greeting1 "Hello world">