Encoding
Unicode in the obvious way as two bytes per character. The standard is the high byte is first (to allow string sorting to match), but due to the prevailance of
small-endian Intel processors and
lazy programmers in Seattle, this data is often low-byte first.
In order to allow the automatic detection of the byte order, it has become customary on some platforms (notably Win32) to start every
Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE), also known as the Byte-Order Mark (BOM). Its
byte-swapped equivalent U+FFFE is not a valid Unicode character, therefore it helps to unambiguously distinguish the Bigendian and
Littleendian variants of UTF-16 and UTF-32.
This is not exactly the same as UTF-16 but pretty close. UTF-16 contains bogus enhancements to make it encode more than 65536 possible characters.
I strongly recommend the use of UTF-8 for all text processing.