Comment by integralid
Comment by integralid 10 hours ago
I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.
In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.
And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.
On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.
Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:
- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"
- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"
- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"
- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were