Comment by csande17

Comment by csande17 14 hours ago

Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:

- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"

- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were

mort96 12 hours ago

> - "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.

Reply View 18 replies

layer8 11 hours ago

Also known as UCS-2: https://www.unicode.org/faq/utf_bom.html#utf16-11
Surrogate pairs were only added with Unicode 2.0 in 1996, at which point Windows NT and Java already existed. The fact that those continue to allow unpaired surrogate characters is in parts due to backwards compatibility.

Reply View | 1 reply
- da_chicken 5 hours ago
  
  Yeah, people forget that Windows and Java appear to be less compliant, but the reality is that they moved on i18n before anybody else did so their standard is older.
  Linux got to adopt UTF-8 because the just stuck their head in the sand and stayed on ASCII well past the time they needed to change. Even now, a lot of programs only support ASCII character streams.
  
  Reply View | 0 replies
mananaysiempre 11 hours ago

WTF-8 is more or less the obvious thing to use when NT/Java/JavaScript-style WTF-16 needs to fit into a UTF-8-shaped hole. And yes, it’s UTF-8 except you can encode surrogates except those surrogates can’t form a valid pair (use the normal UTF-8 encoding of the codepoint designated by that pair in that case).
(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)

Reply View | 4 replies
- layer8 11 hours ago
  
  The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.
  
  Reply View | 3 replies
  
  csande17 11 hours ago
  
  Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...
  
  Reply View | 2 replies
zahlman 11 hours ago

I've always taken "WTF-8" to mean that someone had mistakenly interpreted UTF-8 data as being in Latin-1 (or some other code page) and UTF-8 encoded it again.

Reply View | 10 replies
- deathanatos 11 hours ago
  
  No, WTF-8[1] is a precisely defined format (that isn't that).
  If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)
  (Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)
  [1]: https://simonsapin.github.io/wtf-8/
  
  Reply View | 3 replies
  
  Izkata 8 hours ago
  
  GP is right about the original meaning, author of that page acknowledges hijacking it here: https://news.ycombinator.com/item?id=9611710
  
  Reply View | 2 replies
- chrismorgan 10 hours ago
  
  That thing was occasionally called WTF-8, but not often—it was normally called “double UTF-8” (if given a name at all).
  In the last few years, the name has become very popular with Simon Sapin’s definition.
  
  Reply View | 2 replies
  
  jibal 8 hours ago
  
  "if given a name at all"
  https://en.wikipedia.org/wiki/Mojibake
  
  Reply View | 1 reply
  
  zahlman 2 hours ago
  
  This describes a broader concept.
  
  Reply View | 0 replies
- [removed] 8 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- [removed] 11 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- [removed] 11 hours ago
  
  [deleted]
  
  Reply View | 0 replies

alright2565 12 hours ago

> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

Reply View 5 replies

csande17 11 hours ago

I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)
Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

Reply View | 3 replies
- MyOutfitIsVague 9 hours ago
  
  You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.
  Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.
  It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.
  edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146
  
  Reply View | 1 reply
  
  csande17 8 hours ago
  
  The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.
  
  Reply View | 0 replies
- [removed] 11 hours ago
  
  [deleted]
  
  Reply View | 0 replies
zahlman 11 hours ago

You're not wrong; I gave more detail in a direct reply https://news.ycombinator.com/item?id=44997146 .

Reply View | 0 replies

stuartjohnson12 13 hours ago

> "WTF-8", aka "the JavaScript string type"

This sequence of characters is a work of art.

Reply View 1 reply

wging 10 hours ago

For more details: https://simonsapin.github.io/wtf-8/

Reply View | 0 replies

dcrazy 12 hours ago

Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

Reply View 4 replies

csande17 12 hours ago

IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.

Reply View | 2 replies
- dcrazy 11 hours ago
  
  The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?
  [1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
  
  Reply View | 1 reply
  
  duckerude 8 hours ago
  
  RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)
  If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.
  
  Reply View | 0 replies
layer8 11 hours ago

There is no disagreement that what you can receive over the wire can be ill-formed. There is disagreement about what to reject when it is first parsed at a point where it is known that it should be representing a Unicode string.

Reply View | 0 replies

zahlman 11 hours ago

>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.

Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)

Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)

Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:

  >>> x = '\ud800\udc00'
  >>> x
  '\ud800\udc00'
  >>> print(x)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

They're even disallowed for an explicit encode to utf-16:

  >>> x.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed

But this can be overridden:

  >>> x.encode('utf-16-le', 'surrogatepass')
  b'\x00\xd8\x00\xdc'

Which subsequently allows for decoding that automatically interprets surrogate pairs:

  >>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le')
  >>> y
  '𐀀'
  >>> len(y)
  1
  >>> ord(y)
  65536

Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux):

  $ cat cmdline.py 
  #!/usr/bin/python
  
  import binascii, sys
  for arg in sys.argv[1:]:
      abytes = arg.encode(sys.stdin.encoding, 'surrogateescape')
      ahex = binascii.hexlify(abytes)
      print(ahex.decode('ascii'))
  $ ./cmdline.py foo
  666f6f
  $ ./cmdline.py 日本語
  e697a5e69cace8aa9e
  $ ./cmdline.py $'\x01\x00\x02'
  01
  $ ./cmdline.py $'\xff'
  ff
  $ ./cmdline.py ÿ
  c3bf

It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:

  >>> b'\xff'.decode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
  >>> b'\xff'.decode('utf-8', 'surrogateescape')
  '\udcff'

Reply View 0 replies