UTF-16 (idea)

See all of UTF-16, there is 1 more in this node.

by spitzak

Sat Dec 01 2001 at 0:11:29

Unicode encoded as two bytes per character. The obvioius way to do this is to put the bottom 16 bits into the two bytes (high byte first so sorting order is preserved), and this is called UCS-2. When people realized that (due to Chinese, mostly) more than 65,536 characters were needed, they came up with this bastard encoding, rather than using UTF-8, which is a sensible encoding. MicroSoft uses this encoding in their stuff, sigh.

UTF-16 can encoded Unicode up to 0x10ffff. All codes less than 0xffff but not in the range 0xd800-0xdfff are encoded high byte first, low byte second.

The "characters" 0xd800-0xdfff are called "surrogate characters" and must appear in pairs. These are combined in a complex way to produce the characters in the range 0x10000 through 0x10ffff. They also defeat the only plausible advantage of UTF-16, which is that the characters are the same size!

Don't use this, it is just proof that the standards people have their heads up their asses. Use UTF-8 instead.

I like it!

UTF-8	UTF-32	UCS-2	Unicode
UTF-7	Unicode Transformation Format	UCS-4	Mule-UCS
big-endian	Cosmic Chasm	Specials	Making your own nuclear car bomb
surrogate pair	Surrogates Area	byte order mark	character set
NULL terminator	Tron	little-endian

UTF-16 (idea)

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups

Login
Password

UTF-16 (idea)

Sign In

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups