Comment by const_cast
Comment by const_cast a day ago
IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.
So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.
The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.
This makes no sense with the string type. String is text, but now we don't have text. That's a problem.
We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!
The approach you are advocating is the approach that was abandoned, for good reasons, in the Unix filesystem in the 70s and in Perl in the 80s.
One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.
Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).