Comment by deathanatos
Comment by deathanatos 20 hours ago
> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
> or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem.
The unit is perfectly meaningful.
It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)
Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.
The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.
I don't understand what you mean by "USV count".
> but what is a character?
It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.
> …but "5" or "7"? Where do those even come from?
From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.
> Again: "character in the implementation" is a meaningless concept.
"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.