Comment by MyOutfitIsVague
Comment by MyOutfitIsVague 10 hours ago
You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.
Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.
It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.
edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146
The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.