Comment by mananaysiempre

Comment by mananaysiempre 8 hours ago

4 replies

WTF-8 is more or less the obvious thing to use when NT/Java/JavaScript-style WTF-16 needs to fit into a UTF-8-shaped hole. And yes, it’s UTF-8 except you can encode surrogates except those surrogates can’t form a valid pair (use the normal UTF-8 encoding of the codepoint designated by that pair in that case).

(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)

layer8 7 hours ago

The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.

  • csande17 7 hours ago

    Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...

    • layer8 5 hours ago

      By that logic, you could say ‘“UTF-8” aka “UTF-32”’, since they are encoding the same value space. But that’s just wrong.

      • deathanatos an hour ago

        The type is the same, i.e., if you look at a type as an infinite set of values, they are the same infinite set. Yes, their in-memory representations might differ, but it means all values in one exist in the other, and only those, so conversion between them are infallible.

        So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and ­— of course — one can convert between them infallibly.

        But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-*.) and attempts to do so might error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)