Comment by alright2565

Comment by alright2565 8 hours ago

> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

csande17 8 hours ago

I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)

Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

Reply View 3 replies

MyOutfitIsVague 5 hours ago

You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.
Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.
It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.
edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146

Reply View | 1 reply
- csande17 4 hours ago
  
  The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.
  
  Reply View | 0 replies
[removed] 7 hours ago

[deleted]

Reply View | 0 replies

zahlman 7 hours ago

You're not wrong; I gave more detail in a direct reply https://news.ycombinator.com/item?id=44997146 .

Reply View 0 replies