TXT
From AbiWiki
TXT files are plain ASCII text documents also called 'plain text'. Outside of English speaking countries TXT can also mean extended text.
Contents |
Overview
On current computer systems, the basic unit of information is a byte, which is made up of eight bits. A bit may be on or off. There are 256 possible combinations of on and off in a string of eight bits, so there are 256 possible representations in a byte.
The ASCII (American Society for Computing Information Interchange) standard defines what is represented by the first 127 values of a byte (7 bits). ASCII 0 through 31 are reserved for "control" characters, like tabs, line feeds, and carriage returns. The other values represent characters, that is, upper and lower case letters, numbers, punctuation, a space, and basic mathematical symbols. Values between ASCII 128 and 255 are not defined by the ASCII standard, and may vary depending upon the system used. On PCs, they are used for mathematical symbols, Greek letters, accented "national" characters (extended) and the like.
Plain ASCII text files contain only ASCII 0 - 127. It is a subset of ISO-8859-1 and UTF-8 standard character sets as well as Windows-1252 and many others.
ASCII is used on computers and is also defined for communication between devices and even long distance communication over wire or wireless. It can also be used as a way to transmit binary data using Base64 encoding.
Issues
Here are some issues when trying to read or use TXT.
Line end convention
One issue with ASCII files is the line ending convention. The system using the file needs to know what terminates any particular line of text. Unfortunately, this can vary. Systems running a flavor of the Unix operating system expect lines to be terminated by a Line Feed character, ASCII 10. Apple Macintosh machines use the Carriage Return character, ASCII 13. PCs running DOS or Windows use both, with a "CRLF" combination as the line terminator.
Depending on the origin of the ASCII file, it may be necessary to adjust the line endings for the system you use. On Windows, for example, the default plain text file editor is Notepad. Notepad does not understand text files using only LF as the line ending, and will not display them properly, so text files brought over to a PC from a Unix system need to be massaged to have CRs added to all line ends.
Reflow
Another issue when trying to import TXT files into AbiWord is the presumed length of a line. If the source file text assumes a particular width it is likely to have line ending characters at the end of each line. This, combined with the wrapping of long lines, causes extra lines and often very short lines making the reading experience less pleasant. Typically AbiWord wants line ends only at the ends of paragraphs so that it can wrap the text to the edges of the page. Some import programs may attempt to achieve this by ignoring line end characters unless it sees two sets in a row (an empty line) to mark the end of the paragraph. An additional check would be to mark the end of a paragraph if the next line begins with a TAB (ASCII 9) character or even multiple spaces. (This last check could be confused if multiple lines are indented the same amount.) These techniques are also the methods used by some conversion programs to determine paragraphs. When text is rearranged in this fashion it is called reflowing the text.
Extended Text
Plain ASCII text is really only suitable for English and a few other western languages. Most Latin derived languages need accented characters to properly support the words in their alphabet. This generally requires 8 bit encoding using the ISO-8859-1 standard alphabet. Windows-1252 is another 8 bit encoding scheme. It is not a ISO standard but does permit more Latin derived languages to be supported while still restricting characters to 8 total bits. (8 bits is a standard computer character size called a Byte.)
The newest character set is Unicode (encoded as UTF-8 or UTF-16). It is universal for all languages. There are many more than 255 different characters, so most characters must be represented using more than one byte. That said, UTF-8 is a variable length character code that maintains compatibility with ASCII in that the ASCII character set is included in the code as one byte characters. The 8th bit is used to indicate a multibyte character. Although not part of the original standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8. In text editors and Web Browsers which do not support UTF-8, the Mark will often appear as the ISO-8859-1 characters "".
ASCII Chart
In the chart below the decimal value, the hexadecimal value, and the visible character (or keyboard character) for each ASCII character are shown. The control codes can be generated by holding down the Ctrl (control) key (shown as ^) while striking the key shown. The ASCII codes are:
000 0000 ^@ 032 0x20 064 0x40 @ 096 0x60 ` 001 0x01 ^A 033 0x21 ! 065 0x41 A 097 0x61 a 002 0x02 ^B 034 0x22 " 066 0x42 B 098 0x62 b 003 0x03 ^C 035 0x23 # 067 0x43 C 099 0x63 c 004 0x04 ^D 036 0x24 $ 068 0x44 D 100 0x64 d 005 0x05 ^E 037 0x25 % 069 0x45 E 101 0x65 e 006 0x06 ^F 038 0x26 & 070 0x46 F 102 0x66 f 007 0x07 ^G 039 0x27 ' 071 0x47 G 103 0x67 g 008 0x08 ^H 040 0x28 ( 072 0x48 H 104 0x68 h 009 0x09 ^I 041 0x29 ) 073 0x49 I 105 0x69 i 010 0x0a ^J 042 0x2a * 074 0x4a J 106 0x6a j 011 0x0b ^K 043 0x2b + 075 0x4b K 107 0x6b k 012 0x0c ^L 044 0x2c , 076 0x4c L 108 0x6c l 013 0x0d ^M 045 0x2d - 077 0x4d M 109 0x6d m 014 0x0e ^N 046 0x2e . 078 0x4e N 110 0x6e n 015 0x0f ^O 047 0x2f / 079 0x4f O 111 0x6f o 016 0x10 ^P 048 0x30 0 080 0x50 P 112 0x70 p 017 0x11 ^Q 049 0x31 1 081 0x51 Q 113 0x71 q 018 0x12 ^R 050 0x32 2 082 0x52 R 114 0x72 r 019 0x13 ^S 051 0x33 3 083 0x53 S 115 0x73 s 020 0x14 ^T 052 0x34 4 084 0x54 T 116 0x74 t 021 0x15 ^U 053 0x35 5 085 0x55 U 117 0x75 u 022 0x16 ^V 054 0x36 6 086 0x56 V 118 0x76 v 023 0x17 ^W 055 0x37 7 087 0x57 W 119 0x77 w 024 0x18 ^X 056 0x38 8 088 0x58 X 120 0x78 x 025 0x19 ^Y 057 0x39 9 089 0x59 Y 121 0x79 y 026 0x1a ^Z 058 0x3a : 090 0x5a Z 122 0x7a z 027 0x1b ^[ 059 0x3b ; 091 0x5b [ 123 0x7b { 028 0x1c ^\ 060 0x3c < 092 0x5c \ 124 0x7c | 029 0x1d ^] 061 0x3d = 093 0x5d ] 125 0x7d } 030 0x1e ^^ 062 0x3e > 094 0x5e ^ 126 0x7e ~ 031 0x1f ^_ 063 0x3f ? 095 0x5f _ 127 0x7f ⌂
Some of the more important control codes are:
008 is generally backspace 009 is tab 010 is line feed 012 is form feed (new page) 013 is carriage return 027 is the esc key 127 is typically the delete key (rubout)
Most of the control codes are intended for communication use.
For more information
- ASCII chart - includes control code decode.
- Text editors.org - contains information on all text editors you might want to use for editing TXT files.