Comment by hyperman1
Comment by hyperman1 9 hours ago
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.
So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.
The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?
UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.
The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of
Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?
I suspect the answer is
a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings
b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction
You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:
> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.
The included FSS-UTF that's right before the note does include additive constants.
[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt