Comment by eru

Comment by eru a day ago

20 replies

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

ynik a day ago

Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).

  • arcticbull a day ago

    Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.

    It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

    It would be pretty silly for them to explode all strings to 4-byte characters.

    • jibal a day ago

      You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.

    • account42 a day ago

      It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.

      • jibal a day ago

        Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.

      • zahlman a day ago

        > code points, which need up to 24 bits

        They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.

xigoi a day ago

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

  • zahlman a day ago

    Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.

    But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.

    • capitainenemo 21 hours ago

      My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.

      I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

      • zahlman 21 hours ago

        > a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3.

        Show me.

  • afiori a day ago

    I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

    • account42 a day ago

      But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

      • afiori a day ago

        The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.

        My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.

        I think that the current status quo is better than what came before, but I also think it could be improved.

  • bawolff a day ago

    Me too.

    The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

    Non normalized unicode is just as problematic as non validated unicode imo.

  • jibal a day ago

    Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.

  • account42 a day ago

    Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.

    • astrange a day ago

      C strings are not bags of bytes because they can't contain 0x00.