Comment by zahlman

No, I'm not standing for that.

Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.

It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.

Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.

The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.