Comment by torstenvl

Comment by torstenvl 7 hours ago

1 reply

Ehhh I view things slightly differently. Overlong encodings are per se illegal, so they cannot encode code points, even if a naive algorithm would consistently interpret them as such.

I get what you mean, in terms of Postel's Law, e.g., software that is liberal in what it accepts should view 01001000 01100101 01101010 01101010 01101111 as equivalent to 11000001 10001000 11000001 10100101 11000001 10101010 11000001 10101010 11000001 10101111, despite the sequence not being byte-for-byte identical. I'm just not convinced Postel's Law should be applied wrt UTF-8 code units.

layer8 6 hours ago

The context of my comment was (emphasis mine): “lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings”.

Yes, software shouldn’t accept overlong encodings, and I was pointing out another bad thing that can happen with software that does accept overlong encodings, thereby reinforcing the advice to not accept them.