Comment by ape4

Comment by ape4 14 hours ago

Seems like libraries that serialize to JSON should have an option to filter out these bad characters.

layer8 13 hours ago

No. As the RFC notes: “Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, Section 3.2 of [UNICODE] recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).”

I would almost always go for “signaling an error”.

Reply View 0 replies

Manfred 14 hours ago

My experience writing Unicode related libraries is that people don't use features when you have to explain why and when to use them. I assume that's why Tim puts the emphasis on "working on something new".

Reply View 0 replies

CharlesW 13 hours ago

This RFC and Go-language reference library is designed to be used by existing libraries that do serialization/sanitation/validation. This is hot off the press, so I'm sure Tim would appreciate it if you'd let your favorite library know it exists.

Reply View 1 reply

nikolayasdf123 13 hours ago

interesting. isn't in Go it is just `unicode.IsPrint(r rune)`? https://pkg.go.dev/unicode#IsPrint

Reply View | 0 replies

xdennis 13 hours ago

How is Unicode in any way related to JSON? JSON should just encode whatever dumb data someone wants to transport.

Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.

Reply View 6 replies

layer8 13 hours ago

The contents of JSON strings doesn’t admit random binary data. You need to use an encoding like Base64 for that purpose.

Reply View | 0 replies
zzo38computer 5 hours ago

JSON (unfortunately) requires strings to be Unicode. (JSON has other problems too, but Unicode is one of them.)

Reply View | 0 replies
recursive 13 hours ago

JSON is text. If you're not going to use unicode in the representation of your text, you'll need some other way.

Reply View | 3 replies
- dcrazy 12 hours ago
  
  The current JSON spec mandates UTF-8, but practically speaking encoding is a higher-level concept. I suspect there are many server implementations that will respect the Content-Encoding header in a POST request containing JSON.
  
  Reply View | 0 replies
- ninkendo 11 hours ago
  
  So?
  All the letters in this string are “just text”:
  "\u0000\u0089\uDEAD\uD9BF\uDFFF"
  JSON itself allows putting sequences of escape characters in the string that don’t unescape to valid Unicode. That’s fine, because the strings aren’t required to represent any particular encoding: it’s up to a layer higher than JSON to be opinionated about that.
  I wouldn’t want my shell’s pipeline buffers to reject data it doesn’t like, why should a JSON serializer?
  
  Reply View | 1 reply
  
  recursive 11 hours ago
  
  I actually agree, now that I understand what you're talking about.
  
  Reply View | 0 replies