A Byte Order Mark
(BOM) is a signature at the beginning of a Unicode
data stream that may be used by a higher protocol. The signature can indicate whether a data stream is Unicode encoded or not, and if so, which Unicode Transformation Format
(UTF) is used.
The BOM is U+FEFF ZERO WIDTH NON-BREAKING SPACE (ZWNBSP)
, which can be represented in different byte sequences depending on the UTF:
Byte Sequence Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
If an application does not suspect that a BOM is being used, the BOM may be misinterpreted in various ways. Below are some examples:
Value Your Browser Description
U+BBEF 믯 a Hangul character
U+EFBB personal use
U+FEFF zero width non-breaking space
To encode a ZWNBSP as the first chracter in a data stream that also uses a BOM, simply start with U+FEFF U+FEFF.
While most (if not all) modern Microsoft
applications use BOM, not all software do. For example, the API for Java
1.4 SE treats ZWNBSP like an ordinary character. Its Reader class
es do not attempt to determine an InputStream's encoding by looking for a BOM signature.
FAQ - UTF and BOM