Comment by wrs
Comment by wrs 9 hours ago
UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.
As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.
So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.
UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.
I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.