UTF-8 Unicode Links

Test your browser with many alphabets in UTF-8.


UTF-8 encoding is a good choice for viewing Cuneiform fonts. It is not clear when UTF-8 is not a good choice. This Cuneiform Club has the decimal code points near 73728 that are interpretted correctly. This is when the "encoding" menu specified UTF-8. This page will explore this question of why UTF-8 is better than UTF-32. The decimal 73728 is more than hex 00FFFF and less than hex 10FFFF, so 8 bits and 16 bits are not enough. UTF-16 uses two 16 bit numbers to represent numbers over hexadecimal FFFF. UTF-32 uses one 32 bit number less than hex 10FFFF.



The blogspots where I post Cuneiform code points near decimal 73728 sometimes automatically replace what I post with UTF-16 pairs of 16 bit numbers! The pairs in Unicode are called upper surrogate and lower surrogate. That causes curruption when displayed by some browsers. example pair : & # 55304; & # 56481


Using a browser menu setting for UTF-32 failed for this test page but UTF-8 works well for Firefox character encoding. That failed menu choice was the UTF-32BE for big endian BE. HTML uses UTF-8 but that does not mean I will start posting Cuneiform in UTF-8 bytes instead of codepoints near decimal & # 73728. Apparently it is correct for people to post blogspot code points near 73728 and then the software converts that long integer into several byte integers to form UTF-8 conformant information invisibly in the hardware and software. The UTF-8 bytes then are in RAM memory, but that encoding is not displayed as bytes on you monitor, your monitor shows glyphs or long integers that seem to be in UTF-32 integers.


Use this link to convert a codepoint near hex 012000 to UTF-8.

The 12000 hex code point is converted to UTF-8 bytes F0 92 80 80. That calculation is explained in this linked document on page 94 , Table 3-6 : http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf