Comment by cyberax

Comment by cyberax 9 hours ago

UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.

wrs 9 hours ago

UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.

As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.

So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.

UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.

Reply View 12 replies

rowls66 9 hours ago

I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.

Reply View | 11 replies
- jasonwatkinspdx 7 hours ago
  
  You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.
  This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.
  
  Reply View | 1 reply
  
  duskwuff 5 hours ago
  
  UTF-16 also had a bunch of unfortunate ramifications on the overall design of Unicode, e.g. requiring a substantial chunk of BMP to be reserved for surrogate characters and forcing Unicode codepoints to be limited to U+10FFFF.
  
  Reply View | 0 replies
- gred 8 hours ago
  
  > My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis
  This week's Unicode 17 announcement [1] mentions that of the ~160k existing codepoints, over 100k are CJK codepoints, so I don't think this can be true...
  [1] https://blog.unicode.org/2025/09/unicode-170-release-announc...
  
  Reply View | 0 replies
- duskwuff 8 hours ago
  
  Your understanding is incorrect; a substantial number of the ranges allocated outside BMP (i.e. above U+FFFF) are used for CJK ideographs which are uncommon, but still in use, particularly in names and/or historical texts.
  
  Reply View | 0 replies
- mort96 8 hours ago
  
  The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?
  
  Reply View | 0 replies
- dudeinjapan 8 hours ago
  
  CJK unification (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) i.e. combining "almost same" Chinese/Japanese/Korean characters into the same codepoint, was done for this reason, and we are now living with the consequence that we need to load separate Traditional/Simplified Chinese, Japanese, and Korean fonts to render each language. Total PITA for apps that are multi-lingual.
  
  Reply View | 3 replies
  
  mort96 8 hours ago
  
  This feels like it should be solveable with introducing a few more marker characters, like one code point representing "the following text is traditional Chinese", "the following text is Japanese", etc? It would add even more statefulness to Unicode, but I feel like that ship has already sailed with the U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE characters...
  
  Reply View | 2 replies
- daneel_w 8 hours ago
  
  I entirely agree that we could've cared better for the leading 16 bit space. But protocol-wise adding a second component (images) to the concept of textual strings would've been a terrible choice.
  The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.
  
  Reply View | 1 reply
  
  duskwuff 5 hours ago
  
  > The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification
  I'm not sure what you mean by this. The UTF-8 specification was written long before emoji were included in Unicode, and generally has no bearing on what characters it's used to encode.
  
  Reply View | 0 replies

wongarsu 9 hours ago

Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear

Reply View 0 replies

KerrAvon 5 hours ago

NeXTstep was also UTF-16 through OpenStep 4.0, IIRC. Apple was later able to fix this because the string abstraction in the standard library was complete enough no one actually needed to care about the internal representation, but the API still retains some of the UTF-16-specific weirdnesses.

Reply View 0 replies