Comment by eru
Comment by eru a day ago
Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.
Comment by eru a day ago
Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.
Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.
It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.
It would be pretty silly for them to explode all strings to 4-byte characters.
You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.
I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.
Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.
But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.
My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.
I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...
I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.
But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.
The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.
My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.
I think that the current status quo is better than what came before, but I also think it could be improved.
Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.
Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).