Comment by csande17

Comment by csande17 8 hours ago

IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.

dcrazy 8 hours ago

The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

Reply View 1 reply

duckerude 5 hours ago

RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)
If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.

Reply View | 0 replies