UTF-8 (idea) by StrawberryFrog

The 8-bit Unicode Transformation Format is a computer format for text data. It is a way to store or transmit text that contains the advantages of both Unicode in what it can represent, and the compactness of plain old ASCII.

UTF-8 is a way to encode Unicode text so that the most usual characters take up one byte each, and other characters take 2 to 4 bytes each.

The number of bytes used by UTF-8 to represent a character depends on the Unicode character number – characters with higher Unicode numbers take up more bytes. If the Unicode character number is in the range 0-127, i.e. characters identical to the US-ASCII character set, then this number, padded out to 8 bits with a leading 0, is the UTF-8 encoding.

All other characters are represented by bit strings longer than one byte, with a leading 1 in each byte. The first byte starts with a number of 1's equal to the number of bytes, then a zero. Subsequent bytes start with 10. E.g. a two byte character is represented by bits of the form 110xxxxx 10xxxxxx and a four-byte character is represented by of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.

The xxx bits contain the Unicode number of the character. Note that this leaves only 3-6 bits per byte over to store the character code, so some characters with character numbers less than 16 bits long must be stored in three bytes instead of two.

UTF-8 is defined to have only one right way to encode each character. Where more than one possible encoding is possible, the shortest encoding is the only right one. This is done to prevent the security problem of having strings that look identical to the user but differ to the machine, as is used in URL domain name spoofing.

UTF-8 is the default encoding for XML documents.

Advantages

An advantage of UTF8 over other Unicode encodings is that, assuming you are writing text in the Latin alphabet, most byte values in UTF-8 data will be the same as in the equivalent ASCII file. In fact, if you stick to standard roman letters, control characters and punctuation, then the UTF-8 encoding of the entire data will be bit-for-bit identical to the ASCII encoding.

UTF-8 saves space over UTF-16 or UTF-32 encodings of Unicode text, for the common case where 7-byte characters predominate.

UTF-8 has the advantage over ASCII that the full range of Unicode characters can be stored therein.

A byte sequence that represents an entire UTF-8 character can never occur as a substring of a longer character. This makes parsing UTF-8 simpler.

Null bytes (all zero) never occur in UTF-8 text except to encode the null character. This contrasts with UTF-16, where leading zero bytes are added to all characters in the range 0-255 (i.e. all normal ascii chars). This is important, as much old program code, especially in the C programming language, that is coded with plain ASCII text in mind, interprets a null byte as the end of the text.

The length of the character can be determined by looking at the first bit. If it is zero, then the character is one byte long. Otherwise, count the leading ones.

A reader can synchronise with an UTF-8 stream that it intercepts in mid-stream. The next character start byte will always start with the bit 0 or with the bits 11 (or to put it differently, the next character start byte will be the first byte that doesn't start with 10).

Disadvantages

UTF8 has the disadvantage that many eastern alphabets characters are represented by 3 bytes each, whereas in UTF-16 each character with number under 2¹⁶, Latin or Oriental, takes up 2 bytes. For a document containing mostly Japanese, Chinese or Korean text, UTF-16 may be a more efficient encoding.

Variable-width characters are more complex to process than fixed width characters.

Data compression is sometimes performed on UTF-8 data to remove the redundancy imposed by the UTF-encoding scheme. This is seen as a separate issue to encoding.

Many UTF-8 parsers do not check for illegal characters where an alternative shorter encoding exists, and thus could possibly be exploited in this way.

History

UTF-8 was invented by Ken Thompson in 1992 and implemented by Rob Pike and Ken Thompson in the plan 9 operating system immediately thereafter. It was initially supported by IBM.
UTF-8 is described in RFC 3629, and mandated by RFC 2277.

For more details see wikipedia: http://en.wikipedia.org/wiki/Utf-8

UTF-16	Unicode	UTF-32	RFC 2044
UTF-7	Shift-JIS	Japanese Character Encoding Formats	Only Slightly a Geek Girl
Plan 9	Turkish Alphabet	RFC 2279	ASCII
UCS-2	Converting ASCII to UTF-8	Using Unicode on E2	Russian National Anthem
RFC	Unicode Support	Unicode 3.0	character set
Ken Thompson	Arabic	big-endian	XML