TXT

From AbiWiki

Jump to: navigation, search

TXT files are plain ASCII text documents also called 'plain text'. Outside of English speaking countries TXT can also mean extended text.

Contents

Overview

On current computer systems, the basic unit of information is a byte, which is made up of eight bits. A bit may be on or off. There are 256 possible combinations of on and off in a string of eight bits, so there are 256 possible representations in a byte.

The ASCII (American Society for Computing Information Interchange) standard defines what is represented by the first 127 values of a byte (7 bits). ASCII 0 through 31 are reserved for "control" characters, like tabs, line feeds, and carriage returns. The other values represent characters, that is, upper and lower case letters, numbers, punctuation, a space, and basic mathematical symbols. Values between ASCII 128 and 255 are not defined by the ASCII standard, and may vary depending upon the system used. On PCs, they are used for mathematical symbols, Greek letters, accented "national" characters (extended) and the like.

Plain ASCII text files contain only ASCII 0 - 127. It is a subset of ISO-8859-1 and UTF-8 standard character sets as well as Windows-1252 and many others.

ASCII is used on computers and is also defined for communication between devices and even long distance communication over wire or wireless. It can also be used as a way to transmit binary data using Base64 encoding.

Issues

Here are some issues when trying to read or use TXT.

Line end convention

One issue with ASCII files is the line ending convention. The system using the file needs to know what terminates any particular line of text. Unfortunately, this can vary. Systems running a flavor of the Unix operating system expect lines to be terminated by a Line Feed character, ASCII 10. Apple Macintosh machines use the Carriage Return character, ASCII 13. PCs running DOS or Windows use both, with a "CRLF" combination as the line terminator.

Depending on the origin of the ASCII file, it may be necessary to adjust the line endings for the system you use. On Windows, for example, the default plain text file editor is Notepad. Notepad does not understand text files using only LF as the line ending, and will not display them properly, so text files brought over to a PC from a Unix system need to be massaged to have CRs added to all line ends.

Reflow

Another issue when trying to import TXT files into AbiWord is the presumed length of a line. If the source file text assumes a particular width it is likely to have line ending characters at the end of each line. This, combined with the wrapping of long lines, causes extra lines and often very short lines making the reading experience less pleasant. Typically AbiWord wants line ends only at the ends of paragraphs so that it can wrap the text to the edges of the page. Some import programs may attempt to achieve this by ignoring line end characters unless it sees two sets in a row (an empty line) to mark the end of the paragraph. An additional check would be to mark the end of a paragraph if the next line begins with a TAB (ASCII 9) character or even multiple spaces. (This last check could be confused if multiple lines are indented the same amount.) These techniques are also the methods used by some conversion programs to determine paragraphs. When text is rearranged in this fashion it is called reflowing the text.

Extended Text

Plain ASCII text is really only suitable for English and a few other western languages. Most Latin derived languages need accented characters to properly support the words in their alphabet. This generally requires 8 bit encoding using the ISO-8859-1 standard alphabet. Windows-1252 is another 8 bit encoding scheme. It is not a ISO standard but does permit more Latin derived languages to be supported while still restricting characters to 8 total bits. (8 bits is a standard computer character size called a Byte.)

The newest character set is Unicode (encoded as UTF-8 or UTF-16). It is universal for all languages. There are many more than 255 different characters, so most characters must be represented using more than one byte. That said, UTF-8 is a variable length character code that maintains compatibility with ASCII in that the ASCII character set is included in the code as one byte characters. The 8th bit is used to indicate a multibyte character. Although not part of the original standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8. In text editors and Web Browsers which do not support UTF-8, the Mark will often appear as the ISO-8859-1 characters "".

ASCII Chart

In the chart below the decimal value, the hexadecimal value, and the visible character (or keyboard character) for each ASCII character are shown. The control codes can be generated by holding down the Ctrl (control) key (shown as ^) while striking the key shown. The ASCII codes are:

000  0000  ^@   032  0x20       064  0x40  @    096  0x60  `
001  0x01  ^A   033  0x21  !    065  0x41  A    097  0x61  a
002  0x02  ^B   034  0x22  "    066  0x42  B    098  0x62  b
003  0x03  ^C   035  0x23  #    067  0x43  C    099  0x63  c
004  0x04  ^D   036  0x24  $    068  0x44  D    100  0x64  d
005  0x05  ^E   037  0x25  %    069  0x45  E    101  0x65  e
006  0x06  ^F   038  0x26  &    070  0x46  F    102  0x66  f
007  0x07  ^G   039  0x27  '    071  0x47  G    103  0x67  g
008  0x08  ^H   040  0x28  (    072  0x48  H    104  0x68  h
009  0x09  ^I   041  0x29  )    073  0x49  I    105  0x69  i
010  0x0a  ^J   042  0x2a  *    074  0x4a  J    106  0x6a  j
011  0x0b  ^K   043  0x2b  +    075  0x4b  K    107  0x6b  k
012  0x0c  ^L   044  0x2c  ,    076  0x4c  L    108  0x6c  l
013  0x0d  ^M   045  0x2d  -    077  0x4d  M    109  0x6d  m
014  0x0e  ^N   046  0x2e  .    078  0x4e  N    110  0x6e  n
015  0x0f  ^O   047  0x2f  /    079  0x4f  O    111  0x6f  o
016  0x10  ^P   048  0x30  0    080  0x50  P    112  0x70  p
017  0x11  ^Q   049  0x31  1    081  0x51  Q    113  0x71  q
018  0x12  ^R   050  0x32  2    082  0x52  R    114  0x72  r
019  0x13  ^S   051  0x33  3    083  0x53  S    115  0x73  s
020  0x14  ^T   052  0x34  4    084  0x54  T    116  0x74  t
021  0x15  ^U   053  0x35  5    085  0x55  U    117  0x75  u
022  0x16  ^V   054  0x36  6    086  0x56  V    118  0x76  v
023  0x17  ^W   055  0x37  7    087  0x57  W    119  0x77  w
024  0x18  ^X   056  0x38  8    088  0x58  X    120  0x78  x
025  0x19  ^Y   057  0x39  9    089  0x59  Y    121  0x79  y
026  0x1a  ^Z   058  0x3a  :    090  0x5a  Z    122  0x7a  z
027  0x1b  ^[   059  0x3b  ;    091  0x5b  [    123  0x7b  {
028  0x1c  ^\   060  0x3c  <    092  0x5c  \    124  0x7c  |
029  0x1d  ^]   061  0x3d  =    093  0x5d  ]    125  0x7d  }
030  0x1e  ^^   062  0x3e  >    094  0x5e  ^    126  0x7e  ~
031  0x1f  ^_   063  0x3f  ?    095  0x5f  _    127  0x7f  ⌂

Some of the more important control codes are:

008 is generally backspace
009 is tab
010 is line feed
012 is form feed (new page)
013 is carriage return
027 is the esc key
127 is typically the delete key (rubout)

Most of the control codes are intended for communication use.

For more information

  • ASCII chart - includes control code decode.
  • Text editors.org - contains information on all text editors you might want to use for editing TXT files.
Personal tools