Using Unicode on E2

by gn0sis

Sun Apr 07 2002 at 10:13:56

Rationale

Since Everything2 pages do not contain explicit encoding tags (and the user cannot specify them), the default character set on Everything2 is ISO 8859-1 (aka Latin-1). This is great for English and also sufficient for most Western European languages, since their accented characters (é, ô, ñ, ä...) will show up just fine, but anything outside the basic 255 will run into problems. There is exactly one acceptable solution: using Unicode as HTML character entities.

As you may know, Unicode is a character set that will cover every single script on the planet (and beyond). Characters on the main plane of Unicode (U+0000 to U+FFFF), which almost certainly include everything you will ever need, can be accessed in HTML with the escape sequence &#xcode;. There are several distinct and unique advantages to this approach:

No character set switching. Characters encoded this way are instantly visible, without the user tweaking his encodings, fonts, etc. This is by far the most important single reason to use Unicode on E2.
Multiple languages in one page. Unicode characters are distinct and unique, so they can be mixed and matched freely. There is no other way to use both, say, Hebrew and Arabic in the same writeup.
Guaranteed E2 support. Character entities are interpreted and stored as ordinary text, so they will never be mangled by EDB.
Graceful failover. If the user's browser does not support Unicode (or the subset in question), the user will see question marks or little squares, instead of random 8-bit garbage, which may include control codes that wreak havoc on formatting. (Unfortunately, some very old and/or broken browsers may refuse to recognize the existence of two-byte entities and print the entity string in full, which will look horrible.)

There are, of course, a few downsides:

Inefficiency. Each coded character entity takes up seven bytes, whereas a national character set encoding may squeeze down to one or two. For small quantities of text, this is not really an issue.
Difficulty of entering. Only a few programs can generate HTML encoded characters automatically -- but some tips on fixing this in the next section.
Lack of support. Older browsers typically do not support extended character entities at all, or require painful manual configuration (esp. fonts) for them. Both Mozilla and later versions of Internet Explorer support them quite well though right out of the box, and this problem will gradually solve itself. (Also bear in mind that most older systems that do not support Unicode without tinkering will also not support any other encoding without tinkering.)

When to Use Unicode

Unicode character entities are at their best when you have to refer to small bits of other languages in writeups written mostly in English. For example, a writeup on Chinese astrology may want to mention the original characters (天干) for what are in English dubbed the Heavenly Stems. Speakers of Hebrew may want to trace how בית לחם became Bethlehem, while those of Arabic may wonder how غزة became Gaza. A writeup on Budapest's metro system can't spell Kőbanya-Kispest properly without using a character entity for ő. Students of Japanese can find out what Tokyo (東京) really means. And the list goes on! I recommend putting the Unicode in parentheses after the transliteration or translation, so people who do not speak the language or whose browsers do not support Unicode will still have some idea of what you are talking about.

When Not to Use Unicode

Material written entirely in non-Latin1 languages, on the other hand, is probably best written with some other encoding; Unicode's own UTF-8 might not be a bad choice. As an experiment, I did node the Three Gates of Tosotsu (a Zen text dating back to 600 AD or so) in the original using character entities, but I got a few complaints about screwy formatting -- Chinese doesn't use spaces between words, so even a short line written as an unbroken string of entities will stretch into hundreds of characters on systems that do not fully support Unicode.

Using Unicode characters in node titles is also bit of an iffy business, since they're usually pretty tough to enter and also because EDB doesn't realize that &#xhex;, &#x0hex;, &#dec; and &#0dec; are all the same character. Then again, for "non-transscriptable" languages like Hebrew and Arabic entering the words in Unicode is pretty much the only way to get a unique and identifiable name. But until the search code gets tweaked for better support for non-Latin1 characters, I would have to recommend keeping Unicode out of titles.

Notes on Composed, Right-To-Left and Other Odd Scripts

Some scripts, like Devanagari and Hangul, compose words from individual letters. Some scripts, like Hebrew, write from right to left. A few scripts, like Arabic, are both. Fortunately, Unicode hides all the hellishly complex details of implementation, so غزة (Gaza) is written in Unicode as ghain-zain-teh marbuta, غزة, and your browser's rendering engine will automatically reverse the order and join them as script so that ghain is initial, zain final and teh isolated.

As these computations are left to the user's display engine, it is possible that the browser does not know the proper rendering method and that there are bugs in the rendering code -- for example, Mozilla (at time of writing) still has some difficulties with bidirectional scripts. There is nothing you can do about this, but again, browsers that dig Unicode will usually get these right and the issue is irrelevant for systems that don't support Unicode at all.

Manual Entry

Unicode character entries can be written by hand by looking up the code in a character table and entering them as &#xcode;. Tables of codes can be found at www.unicode.org, the authoritative source, and www.hclrss.demon.co.uk/unicode, which gives the characters packaged more conveniently as HTML tables.

This method is, however, intensely painful for anything more complex than a single name. Also, while OK for alphabetic or syllabic scripts, converting Japanese kanji or Chinese hanzi (漢字) by browsing through 5000 characters is not fun.

Automated Conversion

Some tools can generate character entities on the fly, most notably perhaps Microsoft Word, which converts any script into entities if you Save As... HTML. Alas, this is accompanied with lots of other HTML mangling, so for E2 you'll have to pick out the entity by hand from the generated junk and paste it back into the original. This is OK for one-off operations, but soon becomes painful.

A better option is Java, which includes a remarkable set of tools that can convert almost any encoding into Unicode and back. Once the text is Unicode, it's a simple matter to extract the hex code and pad it, and that's what my little utility J2U does. You'll need a working Java environment to run J2U, writing an applet interface to the tool is on my TODO list.

For Japanese, you can cut and paste strings in any encoding into XJDIC or WWWJDIC (at http://www.csse.monash.edu.au/~jwb/wwwjdic.html), after which performing an "Examine Kanji" on the word gives the Unicode as Uxxxx. unicode.org's Unihan database search provides similar facilities for all languages that use 漢字.

A few more tools and tips sent in by kind noders:

GNU Recode, for converting anything to anything else
Mozilla's Composer, for realtime conversion of native IME input into HTML entities

Cheers to Gorgonzola, lj, Oolong, tres equis and WWWWolf for corrections and additions.

Page category:

« Everything2 on Everything2 »

I like it!

10 C!s

(thing)

by tongpoo

Sun Apr 21 2002 at 3:54:21

I wrote a little javascript to convert Unicode text into character entities. To setup the cross-browser script, create a bookmark or an internet shortcut button with the javascript pasted into the "location" or the "address" field of its property. To use the script, select a string in your browser window that you want to convert, and press the button. If no text is selected when the button is clicked, it will prompt you for input. Press cancel to exit the script. The following script is for copy and pasting:

javascript:p=(document.all)?document.selection.createRange().text:((window.getSelection)? window:document).getSelection().toString();if(!p)void(p=prompt('Text...','')); while(p){q='';for(i=0;i<p.length;i++){j=p.charCodeAt(i);q+=(j==38)?'&': (j<128)?p.charAt(i):'&#'+j+';';}void(p=prompt(p,q));}

The pretty version looks like this:

javascript:
p=(document.all)? document.selection.createRange().text:
  ((window.getSelection)? window:document).getSelection().toString();
if(!p)
  void(p=prompt('Text...',''));
while(p){
  q='';
  for(i=0; i<p.length; i++) {
    j=p.charCodeAt(i);
    q+=(j==38)?'&amp;':(j<128)?p.charAt(i):'&#'+j+';';
  }
  void(p=prompt(p,q));
}

Update 4/17/2003: Better cross-browser support. ASCII characters are now handled differently.

Page category:

« Everything2 on Everything2 »

I like it!

(idea)

by Gritchka

Tue May 21 2002 at 15:11:26

Here is a quick guide to Unicode characters used with some non-Western-European languages. It is organized by language.

For Western languages, see HTML symbol reference. They have HTML entity codes beginning with ampersand and ending with semicolon, around a name, for example é . Most of these should also be creatable on your keyboard using a combination with Alt, Ctrl, or Option keys: see Special Alt key characters & accents. The Western European character set covers English, French, Spanish, Italian, Portuguese, German, Danish, Swedish, Norwegian, Finnish, and in theory Icelandic though in practice the letters thorn and edh often come out wrong. Blame your browser. Greek letters can also be represented by HTML entities such as α .

For brevity I am not repeating those letters that are found in the Western set, with acute, grave, circumflex, umlaut, and so on. See Accent marks used with the Latin alphabet for a list of Western and Eastern accented letters arranged by accent.

In general, do not use accented letters in node titles or in hard links. Even if you think they're better that way. They're not. What's better is if other noders can find them. The E2 Search facility is limited in what it can find: it cannot find ü if you search for u, nor vice versa. Acutes and graves are okay, but umlauts won't work. It is better to leave other accents off. E2 is written in English, not Hungarian, and in English we usually leave all accents off. Please do not put in title edit requests asking for them to be added. If you want the accents to appear in your text, pipelink them, e.g. [Lowenbrau|Löwenbräu]. See E2 FAQ: Using Special HTML Characters for more detail on this.

Never use HTML entities or Unicode in names in node titles. Don't be pedantic about names. Pedantry is bad. Usefulness is good.

In the following tables capital letters come before lowercase. If you can't see them properly, this won't be of use to you. That's a limitation of your browser. A lot of browsers won't be able to show them, and they'll just appear as rectangles or question marks. And I use proper human numbers, not hexadecimal, which means there's no "x" in the code, just &#nnn;.

Large scripts like Chinese and Devanagari are beyond the scope of this write-up, as are extras like the vowel pointing of Hebrew and Arabic. Go to www.unicode.org/charts for all the rest, like Mongolian, Tamil, Ogham, -- the lot.

Albanian

No non-Western letters. Has Ç and Ë.

Arabic

&#1575;  ا    alif
&#1576;  ب    ba
&#1577;  ة    ta marbuta
&#1578;  ت    ta
&#1579;  ث    tha
&#1580;  ج    jim
&#1581;  ح    ha emphatic
&#1582;  خ    kha
&#1583;  د    dal
&#1584;  ذ    dhal
&#1585;  ر    ra
&#1586;  ز    za
&#1587;  س    sin
&#1588;  ش    shin
&#1589;  ص    sad
&#1590;  ض    dad
&#1591;  ط    ta emphatic
&#1592;  ظ    za emphatic
&#1593;  ع    ain
&#1594;  غ    ghain
a gap in numbers
&#1601;  ف    fa
&#1602;  ق    qaf
&#1603;  ك    kaf
&#1604;  ل    lam
&#1605;  م    mim
&#1606;  ن    nun
&#1607;  ه    ha
&#1608;  و    waw
&#1609;  ى    ya undotted
&#1610;  ي    ya dotted

Letters with hamza:
&#1569;  ء    no bearer
&#1571;  أ    alif hamza above
&#1572;  ؤ    waw hamza
&#1573;  إ    alif hamza below
&#1574;  ئ    ya hamza

Other diacritics:
&#1570;  آ    alif maddah
&#1611;  ً    fathah with nunation
&#1612;  ٌ    dammah with nunation
&#1613;  ٍ    kasrah with nunation
&#1614;  َ    fathah     
&#1615;  ُ    dammah     
&#1616;  ِ    kasrah     
&#1617;  ّ    shaddah  
&#1618;  ْ    sukun

Numerals:
&#1632;  ٠    0
&#1633;  ١    1
&#1634;  ٢    2
&#1635;  ٣    3
&#1636;  ٤    4
&#1637;  ٥    5
&#1638;  ٦    6
&#1639;  ٧    7
&#1640;  ٨    8
&#1641;  ٩    9

Arabic transliteration

&#256;  Ā   &#257;  ā   A-macron
&#7692; Ḍ   &#7693; ḍ   D-dot-below
&#7716; Ḥ   &#7717; ḥ   H-dot-below
&#298;  Ī   &#299;  ī   I-macron
&#7778; Ṣ   &#7779; ṣ   S-dot-below
&#7788; Ṭ   &#7789; ṭ   T-dot-below
&#362;  Ū   &#363;  ū   U-macron

Azeri

&#399;  Ə   &#601;  ə   schwa
&#286;  Ğ   &#287;  ğ   G-breve (yumuşak-G)
&#304;  İ               I dotted capital
            &#305;  ı   I undotted lowercase
&#350;  Ş   &#351;  ş   S-cedilla

Also uses Ç, Ö, Ü. Formerly used Ä for Ə and this is still used when symbol Ə is unavailable.

Belarusian

Belarusian uses (part of) the Cyrillic alphabet (see under Russian below) with the following additional letters:

&#1168;  Ґ   &#1169;  ґ   G-hook
&#1030;  І   &#1110;  і   I
&#1038;  Ў   &#1118;  ў   U-breve

Bulgarian

Bulgarian uses (part of) the Cyrillic alphabet (see under Russian below) but with no additional letters.

Catalan

&#319;  Ŀ   &#320;  ŀ   L-mid-dot

Chechen

Has a new Roman alphabet which however has numerous letters not yet representable in Unicode.

Croatian

&#262;  Ć   &#263;  ć   C-acute
&#268;  Č   &#269;  č   C-hacek
&#272;  Đ   &#273;  đ   D-bar
&#352;  Š   &#353;  š   S-hacek
&#381;  Ž   &#382;  ž   Z-hacek

Czech

&#268;  Č   &#269;  č   C-hacek
&#270;  Ď   &#271;  ď   D-hook
&#282;  Ě   &#283;  ě   E-hacek
&#327;  Ň   &#328;  ň   N-hacek
&#344;  Ř   &#345;  ř   R-hacek
&#352;  Š   &#353;  š   S-hacek
&#356;  Ť   &#357;  ť   T-hook
&#366;  Ů   &#367;  ů   U-circle
&#381;  Ž   &#382;  ž   Z-hacek

Also uses Á, É, Í, Ó, Ú, Ý.

Esperanto

&#264;  Ĉ   &#265;  ĉ   C-circumflex
&#284;  Ĝ   &#285;  ĝ   G-circumflex
&#292;  Ĥ   &#293;  ĥ   H-circumflex
&#308;  Ĵ   &#309;  ĵ   J-circumflex
&#348;  Ŝ   &#349;  ŝ   S-circumflex
&#364;  Ŭ   &#365;  ŭ   U-breve

Estonian

No non-Western letters. Has Õ, Ö, Ü.

Hawaiian

&#699;  ʻ               'okina
&#256;  Ā   &#257;  ā   A-macron
&#274;  Ē   &#275;  ē   E-macron
&#298;  Ī   &#299;  ī   I-macron
&#332;  Ō   &#333;  ō   O-macron
&#362;  Ū   &#363;  ū   U-macron

Hebrew

(These letter names are Biblical Hebrew because I know more about that.)

&#1488;  א    aleph
&#1489;  ב    beth
&#1490;  ג    gimel
&#1491;  ד    daleth
&#1492;  ה    he
&#1493;  ו    waw
&#1494;  ז    zayin
&#1495;  ח    heth
&#1496;  ט    teth
&#1497;  י    yod
&#1498;  ך    kaph final
&#1499;  כ    kaph
&#1500;  ל    lamedh
&#1501;  ם    mem final
&#1502;  מ    mem
&#1503;  ן    nun final
&#1504;  נ    nun
&#1505;  ס    samekh
&#1506;  ע    ayin
&#1507;  ף    pe final
&#1508;  פ    pe
&#1509;  ץ    sadhe final
&#1510;  צ    sadhe
&#1511;  ק    qoph
&#1512;  ר    resh
&#1513;  ש    shin/sin
&#1514;  ת    taw

Hungarian

&#336;  Ő   &#337;  ő   O-double-acute
&#368;  Ű   &#369;  ű   U-double-acute

Also has Ö, Ü, and Á, É, Í, Ó, Ú.

Japanese

See the nodes hiragana and katakana.

Japanese transliteration

&#256;  Ā   &#257;  ā   A-macron
&#274;  Ē   &#275;  ē   E-macron
&#332;  Ō   &#333;  ō   O-macron
&#362;  Ū   &#363;  ū   U-macron

Korean transliteration

In one common romanization (no longer officially used) of Hangul these two are used:

&#334;  Ŏ   &#335;  ŏ   O-breve
&#364;  Ŭ   &#365;  ŭ   U-breve

Latin

&#256;  Ā   &#257;  ā   A-macron
&#258;  Ă   &#259;  ă   A-breve
&#274;  Ē   &#275;  ē   E-macron
&#276;  Ĕ   &#277;  ĕ   E-breve
&#298;  Ī   &#299;  ī   I-macron
&#300;  Ĭ   &#301;  ĭ   I-breve
&#332;  Ō   &#333;  ō   O-macron
&#334;  Ŏ   &#335;  ŏ   O-breve
&#362;  Ū   &#363;  ū   U-macron
&#364;  Ŭ   &#365;  ŭ   U-breve

Latvian

&#256;  Ā   &#257;  ā   A-macron
&#268;  Č   &#269;  č   C-hacek
&#274;  Ē   &#275;  ē   E-macron
&#290;  Ģ   &#291;  ģ   G-cedilla
&#298;  Ī   &#299;  ī   I-macron
&#310;  Ķ   &#311;  ķ   K-cedilla
&#315;  Ļ   &#316;  ļ   L-cedilla
&#325;  Ņ   &#326;  ņ   N-cedilla
&#332;  Ō   &#333;  ō   O-macron
&#342;  Ŗ   &#343;  ŗ   R-cedilla
&#352;  Š   &#353;  š   S-hacek
&#362;  Ū   &#363;  ū   U-macron
&#381;  Ž   &#382;  ž   Z-hacek

Lithuanian

&#260;  Ą   &#261;  ą   A-ogonek
&#268;  Č   &#269;  č   C-hacek
&#280;  Ę   &#281;  ę   E-ogonek
&#278;  Ė   &#279;  ė   E-dot-above
&#302;  Į   &#303;  į   I-ogonek
&#352;  Š   &#353;  š   S-hacek
&#362;  Ū   &#363;  ū   U-macron
&#370;  Ų   &#371;  ų   U-ogonek
&#381;  Ž   &#382;  ž   Z-hacek

Macedonian

Macedonian uses (part of) the Cyrillic alphabet (see under Russian below) with the following additional letters:

&#1027;  Ѓ   &#1107;  ѓ   GJ (G-acute)
&#1029;  Ѕ   &#1109;  ѕ   DZ
&#1032;  Ј   &#1112;  ј   J
&#1033;  Љ   &#1113;  љ   LJ
&#1034;  Њ   &#1114;  њ   NJ
&#1036;  Ќ   &#1116;  ќ   KJ (K-acute)
&#1039;  Џ   &#1119;  џ   DZ-hacek

Maltese

&#266;  Ċ   &#267;  ċ   C-dot-above
&#288;  Ġ   &#289;  ġ   G-dot-above
&#294;  Ħ   &#295;  ħ   H-bar
&#379;  Ż   &#380;  ż   Z-dot-above

Māori

&#256;  Ā   &#257;  ā   A-macron
&#274;  Ē   &#275;  ē   E-macron
&#298;  Ī   &#299;  ī   I-macron
&#332;  Ō   &#333;  ō   O-macron
&#362;  Ū   &#363;  ū   U-macron

Persian

The following are additions to the Arabic alphabet used in Persian.

&#1662;  پ    p
&#1670;  چ    ch
&#1688;  ژ    zh
&#1711;  گ    g

Polish

&#260;  Ą   &#261;  ą   A-ogonek
&#262;  Ć   &#263;  ć   C-acute
&#280;  Ę   &#281;  ę   E-ogonek
&#321;  Ł   &#322;  ł   L-slash
&#323;  Ń   &#324;  ń   N-acute
&#346;  Ś   &#347;  ś   S-acute
&#377;  Ź   &#378;  ź   Z-acute
&#379;  Ż   &#380;  ż   Z-dot-above

Also has Ó.

Romanian

&#258;  Ă   &#259;  ă   A-breve
&#350;  Ş   &#351;  ş   S-cedilla
&#354;  Ţ   &#355;  ţ   T-cedilla

The Romanians actually prefer underposed commas instead of cedillas, and there are symbols defined for these too, but they are less likely to show up:

&#536;  Ș   &#537;  ș   S-comma
&#538;  Ț   &#539;  ț   T-comma

Also has Â, Î.

Russian

&#1040;  А  &#1072;  а      a
&#1041;  Б  &#1073;  б      b
&#1042;  В  &#1074;  в      v
&#1043;  Г  &#1075;  г      g
&#1044;  Д  &#1076;  д      d
&#1045;  Е  &#1077;  е      ye
&#1025;  Ё  &#1105;  ё      yo (N.B. out of order!)
&#1046;  Ж  &#1078;  ж      zh
&#1047;  З  &#1079;  з      z
&#1048;  И  &#1080;  и      i
&#1049;  Й  &#1081;  й      y
&#1050;  К  &#1082;  к      k
&#1051;  Л  &#1083;  л      l
&#1052;  М  &#1084;  м      m
&#1053;  Н  &#1085;  н      n
&#1054;  О  &#1086;  о      o
&#1055;  П  &#1087;  п      p
&#1056;  Р  &#1088;  р      r
&#1057;  С  &#1089;  с      s
&#1058;  Т  &#1090;  т      t
&#1059;  У  &#1091;  у      u
&#1060;  Ф  &#1092;  ф      f
&#1061;  Х  &#1093;  х      kh
&#1062;  Ц  &#1094;  ц      ts
&#1063;  Ч  &#1095;  ч      ch
&#1064;  Ш  &#1096;  ш      sh
&#1065;  Щ  &#1097;  щ      shch
&#1066;  Ъ  &#1098;  ъ      hard sign
&#1067;  Ы  &#1099;  ы      y
&#1068;  Ь  &#1100;  ь      soft sign
&#1069;  Э  &#1101;  э      e
&#1070;  Ю  &#1102;  ю      yu
&#1071;  Я  &#1103;  я      ya

Sanskrit transliteration

&#256;  Ā   &#257;  ā   A-macron
&#7692; Ḍ   &#7693; ḍ   D-dot-below
&#7716; Ḥ   &#7717; ḥ   H-dot-below
&#298;  Ī   &#299;  ī   I-macron
&#7734; Ḷ   &#7735; ḷ   L-dot-below
&#7746; Ṃ   &#7747; ṃ   M-dot-below
&#7748; Ṅ   &#7749; ṅ   N-dot-above
&#7750; Ṇ   &#7751; ṇ   N-dot-below
&#7770; Ṛ   &#7771; ṛ   R-dot-below
&#7772; Ṝ   &#7773; ṝ   R-dot-and-macron
&#346;  Ś   &#347;  ś   S-acute
&#7778; Ṣ   &#7779; ṣ   S-dot-below
&#7788; Ṭ   &#7789; ṭ   T-dot-below
&#362;  Ū   &#363;  ū   U-macron

Also uses Ñ.

Serbian

Serbian uses (part of) the Cyrillic alphabet (see under Russian above) with the following additional letters:

&#1026;  Ђ   &#1106;  ђ   D-bar
&#1032;  Ј   &#1112;  ј   J
&#1033;  Љ   &#1113;  љ   LJ
&#1034;  Њ   &#1114;  њ   NJ
&#1035;  Ћ   &#1115;  ћ   C-acute
&#1039;  Џ   &#1119;  џ   DZ-hacek

Slovak

&#268;  Č   &#269;  č   C-hacek
&#270;  Ď   &#271;  ď   D-hook
&#313;  Ĺ   &#314;  ĺ   L-acute
&#317;  Ľ   &#318;  ľ   L-apostrophe
&#327;  Ň   &#328;  ň   N-hacek
&#340;  Ŕ   &#341;  ŕ   R-acute
&#352;  Š   &#353;  š   S-hacek
&#356;  Ť   &#357;  ť   T-hook
&#381;  Ž   &#382;  ž   Z-hacek

Also has Á, É, Í, Ó, Ú, Ý, and also Ô.

Turkish

&#286;  Ğ   &#287;  ğ   G-breve (yumuşak-G)
&#304;  İ               I dotted capital
            &#305;  ı   I undotted lowercase
&#350;  Ş   &#351;  ş   S-cedilla

Also has Ç, Ö, Ü.

Turkmen

&#327;  Ň   &#328;  ň   N-hacek
&#350;  Ş   &#351;  ş   S-cedilla
&#381;  Ž   &#382;  ž   Z-hacek

Also uses Ä, Ç Ö, Ü, Ý. Originally reported as using currency symbols $, ¢, ¥, but it seems these have now been replaced.

Ukrainian

Ukrainian uses the Cyrillic alphabet (see under Russian above) with the following additional letters:

&#1028;  Є   &#1108;  є   curved-E
&#1030;  І   &#1110;  і   I
&#1031;  Ї   &#1111;  ї   I-umlaut
&#1168;  Ґ   &#1169;  ґ   G-hook

Vietnamese

&#258;  Ă   &#259;  ă   A-breve
&#272;  Đ   &#273;  đ   D-bar
&#416;  Ơ   &#417;  ơ   O-hook
&#431;  Ư   &#432;  ư   U-hook

These and Â, Ê are letters of the Vietnamese alphabet; there are also numerous other accents for tone marks, which may be combined with any of the vowels.

Welsh

&#372;  Ŵ   &#373;  ŵ   W-circumflex
&#374;  Ŷ   &#375;  ŷ   Y-circumflex

Also has Â, Ê, Î, Ô Û, and occasionally some others such as Ï.

Yoruba

&#7864; Ẹ   &#7865; ẹ   E-dot-below
&#7884; Ọ   &#7885; ọ   O-dot-below
&#7778; Ṣ   &#7779; ṣ   S-dot-below

Page category:

« Everything2 on Everything2 »

I like it!

4 C!s

(idea)

by liveforever

Tue Jun 11 2002 at 21:53:54

Chess symbols in Unicode:

♔ - ♔
♕ - ♕
♖ - ♖
♗ - ♗
♘ - ♘
♙ - ♙
♚ - ♚
♛ - ♛
♜ - ♜
♝ - ♝
♞ - ♞
♟ - ♟

By now, you will have noticed, O astute reader, that these all come on simple, white backgrounds. Apparently, the good people defining the Unicode standard don't play chess enough to understand how clueless this is.

Bottom line: if you want to node a chessboard with black squares, you're in trouble. Still, I'm sure you're creative enough to find a workaround.

If your browser doesn't display the characters, you'll need to get yourself a font that contains the upper range of Unicode characters - say, Arial Unicode MS, or some such.

Page category:

« Everything2 on Everything2 »

I like it!

(idea)

by tres equis

Fri Feb 14 2003 at 1:08:17

I love gn0sis's idea above for adding the native name, in Unicode, to your writeups about foreign terms, concepts, people, etc.

Here are many important terms I have gathered in various languages, all of them already noded or certainly nodeable.
They are listed first in alphabetical order of their native language, thence in English alphabetical order of the English word, English spelling, or English transliteration.
There are no doubt errors here since I don't speak any of these languages, so please /msg me if you find an error, have an addition, or if you use one of these in a writeup of yours.

To use these, you can try just cutting and pasting into your writeup - this works for some browsers depending on the configuration. Otherwise, use your browser's "View Source" menu and cut and paste the HTML entities.

Amharic (አማርኛ):

Addis Ababa (አዲስ አበባ)
Ethiopia (ኤትዮጵያ)

Arabic (العربية):

→ See Using Arabic on E2

Armenian (Հայ, Հայերեն):

Ararat (Արարատ)
Armenia (Հայաստան)
Artem Ivanovich Mikoyan (?)
Saint Mesrop Mashtots, Mesrob Mashtots (Մեսրոպ Մաշտոց)
Yerevan (Երևան)

Assamese (অসমিয়া):

Assam (आसाम)

Azerbaijani, Azeri (Азәрбайжан):

Baku (Баку, Bakı)

Belarusian (Беларуская):

Mensk, Minsk (Мінск)

Bengali (বাঙালী):

Bangladesh (বাংলাদেশ)
Bengal (বাঙালী)
Calcutta, Kolkata (কলকাতা)
Dhaka (ধাক)
Ganges (?)
Gayatri Chakravorty Spivak (?)
Sri Chinmoy (?)
West Bengal (পশ্চিম বঙ্গ)

Bulgarian (Български):

Sofia (София)

Cherokee (ᏣᎳᎩ):

Sequoyah, Sequoya (ᏍᏏᏉᏯ)

Chinese (汉语, 中文):

→ See Using Chinese on E2

Czech (Česky):

Antonín Dvorák (Antonín Dvořák)
Bohuslav Martinu (Bohuslav Martinů)

Dhivehi (ދިވެހިބަސް):

Male (މާލެ)
Maldives (ދިވެހި ރާއްޖެ)
Thaana (ަނާތ)

Dzongkha (༄༅ཇོ༹ང་ཁ):

Bhutan (འབྲུག་ཡུལ)

Farsi, Persian (فارسی):

algorithm (الگوریتم)
Ayatollah Khomeini (امام خمینی, آیة الله خمینی)
Caspian Sea (?)
chador (چادر)
Darius (داریوش)
Esfahan, Isfahan (اصفهان)
Farsi (فارسى)
Iran (ایران)
kismet (?)
Omar Khayyam (عمر خیام)
The Rubaiyat, The Rubáiyát (?)
Persia (پارس)
Persian Gulf (خلیج فارس)
rial (﷼, ریال)
Shah (شاه)
Tehran (تهران)
Xerxes (خشایارشا)

Georgian (ქართულად):

Eduard Shevardnadze (ედუარდ შევარდნაძე)
Georgia (საქართველო)
King Farnavaz (ფარნავაზი)
Khutsuri (ხუცური)
Saint Mesrop Mashtots, Mesrob Mashtots (?)
Mkhedruli (მხედრული)
Tbilisi (თბილისი)

Greek (Ελληνικά):

Athens (Αθήναι kath., Αθήνα dim.)
Cyclades (Κυκλάδες)
Cyprus (Κύπρος)
Deus ex Machina (από μηχανής θεός)
Dhimotiki (δημοτική)
Greece (Ελλάς kath., Ελλάδα dim.)
katharevousa (καθαρεύουσα)
koine (κοινή)
Lerna (Λερνα)
Lesbos (Λέσβος)
Nicosia (Λευκωσία)
ouzo (ούζο)
Panhellenic Socialist Movement (Πανελλήνιο Σοσιαλιστικο Κινήμα)
Parisoula Lampsos (?)
PASOK (ΠΑΣΟΚ)

Gujarati (ગુજરાતી):

Gujarat (ગુજરાત)
India (ભારત)

Hebrew (עברית):

→ See Using Hebrew on E2

Hindi (हिन्दी):

Bhopal (भोपाल)
bungalow (?)
crore (?)
dacoit (?)
Delhi (दिल्ली)
deodar (?)
Devanagari (देवनागरी)
dinghy (?)
dungaree (?)
Ganges (?)
ghee (?)
Goa (गोआ)
gymkhana (?)
Hindu (हिन्दू)
India (भारत)
jodhpurs (?)
Kaante (कांटे)
lakh (?)
loot (?)
Mumbai, Bombay (मु॑बई)
paisa (?)
pakora (?)
Pondicherry (पॉंडिचेरी)
Raj (?)
Rajasthan (राजसंथान)
samosa, samoosa (?)
shampoo (?)
tandoori (?)
tom-tom (?)
walla, wallah (?)

Inuktitut (ᐃᓄᒃᑎᑐᑦ):

Nunavut (ᓄᓇᕗᑦ)

Japanese (日本語):

→ See Using Japanese on E2

Kannada, Kanarese (ಕನ್ನಡ):

India (ಭಾರತ)
Karnataka (ಕನಾೕಟಕ)

Kashmiri (कश्मीरी, كشميري):

Kashmir (كشمير)
Srinagar (سرينگر)

Kazakh (Қазақ):

Altynai Asylmuratova (Алтынай Асылмуратова?)
Astana (Астана)
Baikonur ()
Kazakhstan (Қазақстан)

Khmer, Cambodian (ខ្មែរ):

Angkor Wat (អង្គរវត្ដ)
Hun Sen (?)
Khmer Rouge (?)
Killing Fields of Choeung Ek (...)
King Sihanouk, Norodom Sihanouk, Prince Sihanouk (?)
Phnom Penh (ភ្លពេញ)
Pol Pot, Saloth Sar (?)

Klingon, tlhIngan Hol ( ):

Konkani (कोंकणी):

India (भारत)

Korean (한국어):

bulgogi (불고기)
cha, ch'a (차)
Daewoo (대우)
hangul (한글)
hanja (한자)
Hyundai (현대)
jamo (자모)
jindo dog (진돗개, 진도개)
Jjajang myeon (자장면, 炸醬麵)
Kia (기아)
Kim Dae Jung (김대중)
Kim Il Sung (김일성, 金日成)
Kim Jong Il (김정일)
kimchi, kimchee (김치)
King Sejong the Great (세종대왕, 世宗大王)
Kuk Sool Won (국술원)
Korea (한국, 대한민국)
North Korea (북한)
Pyongyang (평양, 平壤市)
Samsung (삼성)
Seoul (서울)
soju (소주, 燒酒)
South Korea (남한)
Tae Kwon Do (태권도)
Tang Soo Do (당수도)
won (원)

Kyrgyz (кыргыз, кыргызча):

Bishkek (Бишкек)
Kyrgyzstan (Кыргызстан)

Lao (ລາວ):

Beer Lao (ເບຍລາວ)
Lao Aviation (ການບິນລາວ)
Laos (ລາວ)
Luang Phabang (?)
Savannakhet (?)
Vientiane (?)

Latvian (latviešu):

Riga (Rīga)

Macedonian (Македонски):

Macedonia (Македонија)
Skopje (Скопје)

Malayalam (മലയാളം):

betel (?)
coir (?)
copra (?)
ginger (?)
Kerala (േകരളം)
teak (?)

Manipuri (?):

Manipur (मिनपुर)

Maori (Māori):

Auckland (Tāmaki makau rau, Ākarana)
Christchurch (Otautahi)
New Zealand (Aotearoa, Niu Tireni)
North Island (Te Ika a Māui, Aotearoa)
South Island (Te Waipounamu)
Wellington (Pōneke, Upoko o te Ika)

Marathi (मराठी):

India (भारत)
mongoose (?)

Mongolian (монгол хэл):

Mongolia (Монгол улс)
Ulaanbaatar, Ulaan Baatar, Ulan Bator (Улаанбаатар)

Myanmar, Burmese (မ္ရန္မာ):

Daw Aung San Suu Kyi (ဒော္‌ အောင္ ဆန္း စု က္ရည္)
Bagan, Pagan (ဟောင္း)
Bago, Pegu (ပဲခူး)
Mandalay (ဓန္တလေး)
Mingun Paya (?)
Mrauk U (?)
Swedagon Paya (?)
Shwemawdaw Paya (?)
Yangon, Rangoon (ရန္ကုန္)

Nepali (नेपाली):

Gangtok (गंगटोक)
Kathmandu] (काठ्माड़ौं)
Mount Everest, Sagarmatha (सगरमाथा)
Nepal (नेपाल)
Sikkim (सिक्किम)
Tenzing Norgay (?)

Oriya (ଓଡ଼ିଆ):

Orissa (ଓରୀଶା)

Pashto (پښتو):

Afghanistan (افغانستان)
Hamid Karzai (?)
Kabul (كابل)
Kandahar (کندهار)
Loya Jirga (?)

Pitjantjatjara:

Anangu (Aṉangu)
Kata Tjuta, The Olgas (Kata Tjuṯa)
Uluru, Ayers Rock (Uluṟu)

Polish (Polski):

Krzysztof Kieslowski (Krzysztof Kieślowski)
Lech Walesa (Lech Wałęsa)

Punjabi (ਪੰਜਾਬੀ, پنجابي, पंजाबी):

Chandigarh (ਛੰਦੀਗਰਹ)
Gurmukhi (ਗੁਰ੍ਮੁਖੀ)
India (ਭਾਰਤ)
Kalpana Chawla (ਕਲ੍ਪਨਾ ਚਾਵਲਾ, कल्पना चावला) (?? - Devanagari/Hindi version found on BBC, Punjabi version derived from it but can't find on the net.)
Punjab (ਪਂਜਾਬ)

Romanian (Română):

Bucharest (București, Bucureşti)
Elena Ceausescu (Nicolae Ceaușescu, Nicolae Ceauşescu)
Nicolae Ceausescu (Nicolae Ceaușescu, Nicolae Ceauşescu)
Romania (România)

Russian (Русский):

→ See Using Russian on E2

Sanskrit (संस्कृत):

ahimsa (?)
asana (?)
ashrama (?)
atman (?)
avatar (?)
bodhisattva (?)
brahmin (?)
Buddha (बुद्ध)
chakra (?)
devanagari (देवनगरी)
dharma (धर्म)
guru (?)
hatha yoga (?)
India (भारतम्)
Indra (इन्द्र)
karma (?)
lingam (?)
maharaja, maharajah (?)
mahatma (?)
mantra (?)
Maya (?)
nirvana (?)
raja, rajah (?)
rani, ranee (?)
satyagraha (?)
sutra (सूत्र)
swastika (?)
yantra (?)
veda (वेद)
yoga (योग)
yogasana (?)

Serbian (Српски, Srpski):

Belgrade (Београд, Beograd)
Serbia (Србија, Srbija)
Serbocroatian (Српскохрватски, Srpskohrvatski)
Slobodan Milosevic (Слободан Милошевић, Slobodan Milošević)
Yugoslavia (Југославија, Yugoslavija)
Zoran Djindjic (Зоран Ђинђић, Zoran Đinđić)

Sinhala, Sinhalese (සිංහල):

anaconda (?)
Colombo (කොළඹ)
Sri Lanka (ශෘිලංකා ?)
tourmaline (?)

Slovenian (Slovenščina):

France Preseren (France Prešeren)

Tajik (тоҷикӣ):

Dushanbe (Душанбе)
Tajikistan (Тоҷикистон)

Tamil (தமிழ்):

catamaran (?)
cheroot (?)
Colombo (கொழம்ப)
curry (?)
India (இந்தியா)
mango (?)
mulligatawny (?)
pariah (?)
Sri Lanka (?)
Tamil Nadu (தமிழ் நாடூ)

Telugu (తెలుగు):

bandicoot (?)
India (భారత దేశం)

Thai (ไทย):

Ananda Mahidol (?)
Ayuthaya, Ayutthaya, Ayodhya (อยุธยา)
Baht (฿, บาท)
Bangkok (กรุงเทพฯ)
Bhumibol Adulyadej, Rama IX (?)
Chakri (?)
Chang Beer (?)
Chao Phraya (?)
chedi, pagoda (?)
Chiang Mai (เชียงใหม่)
Chula Chakrabongse (?)
Chulalongkorn (?)
Ekatrina Desnitskaya (?)
farang (ฝรั่ง)
Koh Phangan (เกาะพงัน)
Koh Samui (เกาะสมุย)
Koh Tao (เกาะเต่า)
Loy Krathong (?)
Mekong (?)
Mongkut, Rama IV (?)
muay thai (มวยไทย)
Phuket (ภูเก็ต)
Rama I Chakri (?)
Rama II, Phra Phutthaloetla (?)
Rama III (?)
Rama IV, Mongkut (?)
Rama IX, Bhumibol Adulyadej (?)
Ramkhamhaeng (?)
Siam (?)
Sirindhorn (?)
Somdet Phra Nyanasamvara (?)
Songkran (สงกรานต์)
Sukhothai (?)
Suratthani (สุราษฎร์ธานี)
Taksin (?)
Thailand (ประเทศ ไทย)
Thammasat University (?)
Thonburi (?)
uparat (?)
Vajiralongkorn (?)
wat (วัด)
Wat Arun (?)
Wat Chai Watthanaram (?)
Wat Chiang Man (?)
Wat Pho (วัดโพธิ์)
Wat Phra Kaew (วัดพระแก้ว)
Wat Phra Singh (วัดสิงห์)

Tibetan (བོད་ཡིག):

Buddha (?)
Dalai Lama (ད་ལའི་བླ་མ་)
Gelugpa (?)
Karmapa Lama (? བླ་མ་)
lama (བླ་མ་)
Lhamo Dhondrup (?)
Lhasa (ལྷ་ས་)
Mount Everest, Qomolangma, Chomolungma (ཇོ་མོ་གླང་རི་)
Panchen Lama (? བླ་མ་)
Sherpa (ཤར་བ་)
Shigatse (?)
Tenzin Gyatso (?)
Tibet (བོད་)
Ü (དབུས་)
yeti (?)

Tigrinya (ትግርኛ):

Eritrea (ኤርትራ)

Turkish (Türkçe):

Cyprus (Kıbrıs)
Istanbul (İstanbul)
Nicosia (Lefkoşa)
Turkey (Türkiye)

Turkmen (Туркмен):

Turkmenistan (Туркменистан)
Ashgabat (Ашхабад)

Ukrainian (Український):

Chernobyl (Чернобыль)
Ivan Turgenev (Иван Тургенев)
Kyyiv, Kiev (Київ)
Leonid Ilyich Brezhnev ukrain (Леонид Ильич Брежнев)
Nikolai Gogol (Николай Гоголь)

Urdu (ﻮﺪﺭﺃ):

Amritsar (امرتسر)
bungalow (?)
crore (?)
dacoit (?)
deodar (?)
dinghy (?)
dungaree (?)
ghee (?)
gymkhana (?)
Islamabad (اسلاماباد)
jodphurs (?)
Karachi (کراچي)
Khyber Pass (خىءبر پاس or خيبر پاس ?)
Lahore (لاهور)
lakh (?)
loot (?)
Nusrat Fateh Ali Khan (نصرت فتح علی خان)
paisa (?)
Pakistan (پاکستان)
pakora (?)
Peshawar (پشاور)
Raj (?)
samosa, samoosa (?)
shampoo (?)
tandoori (?)
tom-tom (?)
wallah (?)

Uzbek (O'zbekcha, Ўзбек):

Bukhara (Buxoro, Бұхара?)
Samarkand (Samarqand, Самарқанд)
Tashkent (Toshkent, Тошкент)
Uzbekistan (O'zbekiston, Ўзбекистан)

Vietnamese (tiếng Việt):

ao dai (áo dài)
cha gio (chả giò)
Cao Dai (Cao Đài)
Da Nang (Đà Nẵng)
dong (đồng)
Hanoi (Hà Nội)
Ho Chi Minh, Ho Chi Minh City (Hồ Chí Minh)
Hue (Huế)
Le Duc Tho (Lê Ðức Thọ)
Mekong (Mê Kông)
Nguyen Hue (Nguyễn Huệ)
Saigon (Sài Gòn)
Tet (Tết, Tết Nguyên Ðán)

Yiddish (ייִדיש)

Hebrew (העברעיִש)
Israel (ישׂראל)
Jew (ייִד)
Russia (רוסלאַנד)

Sources:
Google
Foreign language dictionaries at various Sydney libraries
http://www.geonames.de/
http://www.zhongwen.com/

Page category:

« Everything2 on Everything2 »

I like it!

1 C!

HTML symbol reference	Three Gates of Tosotsu	Unicode	GNU Recode
hiragana	How to read Japanese characters in E2	Using Chinese on E2	Tengwar
Accent marks used with the Latin alphabet	Using Japanese on E2	katakana	Using Arabic on E2
Tokyo	Special Alt key characters & accents	J2U	EDB
UTF-8	Japanese Planet Names	ISO 8859-1	Russian Alphabet (CP1251 encoded)
ő	Hangul	Arabic	Heavenly Stems

Login
Password

Rationale

When to Use Unicode

When Not to Use Unicode

Notes on Composed, Right-To-Left and Other Odd Scripts

Manual Entry

Automated Conversion

Page category:

Page category:

Arabic transliteration

Japanese transliteration

Korean transliteration

Sanskrit transliteration

Page category:

Page category:

Page category:

Sign In

Recommended Reading

New Writeups