This
Unicode Transformation Format serializes each Unicode value as two bytes, or in case of values above U+FFFF, four bytes (a
surrogate pair). A UTF-16 can be either in
little-endian or
big-endian format. An initial byte sequence called the
byte order mark (BOM) is required for UTFs. The BOM is U+FEFF
ZERO WIDTH NO-BREAK SPACE (therefore it doesn't do anything) and it can have several different byte sequences:
To prevent ambiguity, U+FFFE is not defined.
The Unicode codespace is allocated into several areas, one being the Surrogate Area, which consists of
1,024 high surrogates (U+D800 - U+DBFF) and
1,024 low surrogates (U+DC00 - U+DFFF).
A high surrogate, followed by a low surrogate, forms a
surrogate pair that represents a single Unicode scalar value. Approximately one million surrogate pairs are possible, and their values can be derived from this formula:
65536 + ((highSurrogate & 1023) << 10) + (lowSurrogate & 1023)
In plain English, it takes the the last ten binary digits from both surrogates, concatinates those, and adds 2
16 to that number.
As of Version 3.0, none of the surrogate pairs have been assigned.
UTF-16 on average can save about a byte per character over UTF-8 when encoding East Asian text.
Sources (PDF and PowerPoint files):
- "The Unicode Standard, Version 3.0" Section 2.3, Encoding Forms.
http://www.Unicode.org/book/ch02.pdf
- "The Unicode Standard, Version 3.0" Section 3.7, Surrogates.
http://www.Unicode.org/book/ch03.pdf
- "The Unicode Standard, Version 3.0" Section 5.4, Handling Surrogate Pairs.
http://www.Unicode.org/book/ch05.pdf
- "Surrogate Support in Microsoft Products."
http://www.Unicode.org/iuc/iuc18/papers/a8.ppt