Comment by 0x000xca0xfe
Comment by 0x000xca0xfe a day ago
I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere.
And bytes can conveniently fit both ASCII and UTF-8.
If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.
But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more.
> "let wohnt_bei_Böckler_STRAẞE"
This string cannot be encoded as ASCII in the first place.
> But if you allow full 8-bit bytes, please don't restrict them to UTF-8
UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.
It sound's like you're confusing ASCII, Extended ASCII and UTF-8:
- ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points)
- Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage)
- UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range).
Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8.