Comment by gertop
Comment by gertop 8 hours ago
UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.
UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.
When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).
> UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.
UTF-8 and UTF-16 take the same number of characters to encode a non-BMP character or a character in the range U+0080-U+07FF (which includes most of the Latin supplements, Greek, Cyrillic, Arabic, Hebrew, Aramaic, Syriac, and Thaana). For ASCII characters--which includes most whitespace and punctuation--UTF-8 takes half as much space as UTF-16, while characters in the range U+0800-U+FFFF, UTF-8 takes 50% more space than UTF-16. Thus, for most European languages, and even Arabic (which ain't European), UTF-8 is going to be more compact than UTF-16.
The Asian languages (CJK-based languages, Indic languages, and South-East Asian, largely) are the ones that are more compact in UTF-16 than UTF-8, but if you embed those languages in a context likely to have significant ASCII content--such as an HTML file--well, it turns out the UTF-8 still wins out!
> When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).
You'll notice that the encodings that are used are not UTF-16 either. Also, my understanding is that China generally defaults to UTF-8 nowadays despite a government mandate to use GB18030 instead, so it's largely Japan that is the last redoubt of the anti-Unicode club.