Comment by modeless

Comment by modeless 5 hours ago

UTF-8 is great and I wish everything used it (looking at you JavaScript). But it does have a wart in that there are byte sequences which are invalid UTF-8 and how to interpret them is undefined. I think a perfect design would define exactly how to interpret every possible byte sequence even if nominally "invalid". This is how the HTML5 spec works and it's been phenomenally successful.

ekidd 5 hours ago

For security reasons, the correct answer on how process invalid UTF-8 is (and needs to be) "throw away the data like it's radioactive, and return an error." Otherwise you leave yourself wide open to validation bypass attacks at many layers of your stack.

Reply View 1 reply

modeless 5 hours ago

This is only true because the interpretation is not defined, so different implementations do different things.

Reply View | 0 replies