Comment by deathanatos

Comment by deathanatos 20 hours ago

7 replies

> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.

You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.

zahlman 20 hours ago

> or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem.

The unit is perfectly meaningful.

It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)

Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.

I don't understand what you mean by "USV count".

> but what is a character?

It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.

> …but "5" or "7"? Where do those even come from?

From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.

> Again: "character in the implementation" is a meaningless concept.

"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

  • deathanatos 19 hours ago

    > Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

    Python does not use UTF-32, even notionally. Yes, I know it uses a compact representation in memory when the value is ASCII, etc. That's not what I'm talking about here. |str| != |all UTF32 strings|; `str` and "UTF-32" are different things, as there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

    Code points is not a meaningful metric, though I suppose strictly speaking, yes, len() is code points.

    > I don't understand what you mean by "USV count".

    The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.) It's the basic building block of Unicode. It's only marginally useful, and there's a host of other more meaningful metrics, like memory size, terminal width, graphemes, etc. But it's more meaningful than code points, and if you want to do anything at any higher level of representation, USVs are going to be what you want to build off. Anything else is going to be more fraught with error, needlessly.

    > It's what the Unicode standard says a character is.

    The Unicode definition of "character" is not a technical definition, it's just there to help humans. Again, if I fed that definition to a human, and asked the same question above, <facepalm…> is 1 "character", according to that definition in Unicode as evaluated by a reasonable person. That's not the definition Python uses, since it returns 5. No reasonable person is looking at the linked definition, and then at the example string, and answering "5".

    "How many smallest components of written language that has semantic value does <facepalm emoji …> have?" Nobody is answering "5".

    (And if you're going to quibble with my use of definition (1.), the same applies to (2.). (3.) doesn't apply here as Python strings are not Unicode strings (again, |str| != |all Unicode strings|), (4.) is specific to Chinese.)

    > "Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

    A lot of people write bad code does not make bad code good. Ambiguous technical documentation is likewise not made good by being ambiguous. Any use of "character" in technical writing would be made more clear by replacing it with one of the actual technical terms defined by Unicode, whether that's "UTF-16 code point", "USV", "byte", etc. "Character" leaves far too much up to the imagination of the reader.

    • zahlman 18 hours ago

      > there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

      Yes, yes, the `str` type may contain data that doesn't represent a valid string. I've already explained elsewhere ITT that this is a feature.

      And sure, pedantically it should be "UCS-4" rather than UTF-32 in my post, since a str object can be created which contains surrogates. But Python does not use surrogate pairs in representing text. It only stores surrogates, which it considers invalid at encoding time.

      Whenever a `str` represents a valid string without surrogates, it will reliably encode. And when bytes are decoded, surrogates are not produced except where explicitly requested for error handling.

      > The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.)

      Ah.

      Good news: since Python doesn't use surrogate pairs to represent valid text, these are the same whenever the `str` contents represent a valid text string in Python. And the cases where they don't, are rare and more or less must be deliberately crafted. You don't even get them from malicious user input, if you process input in obvious ways.

      > The Unicode definition of "character" is not a technical definition, it's just there to help humans.

      You're missing the point. The facepalm emoji has 5 characters in it. The Unicode Consortium says so. And they are, indisputably, the ones who get to decide what a "character" is in the context of Unicode.

      I linked to the glossary on unicode.org. I don't understand how it could get any more official than that.

      Or do you know another word for "the thing that an assigned Unicode code point has been assigned to"? cf. also the definition of https://www.unicode.org/glossary/#encoded_character , and note that definition 2 for "character" is "synonym of abstract character".

perching_aix 19 hours ago

As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.

I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.

Now of course:

- it coming in handy once for my specific random workload doesn't mean it's good design

- my specific workload may not be rational (am a dingus sometimes)

- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome

- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.

But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.

  • bobsmooth 18 hours ago

    I'm curious what you mean by "shenanigans" is that like emojis and zalgo text?

    • perching_aix 18 hours ago

      I'm currently working with some local legacy code, so I primarily wanted to scan for incorrectly transcoded accented characters (central-european to utf-8 mishaps) - did find them.

      Also good against data fingerprinting, homoglyph attacks in links (e.g. in comments), pranks (greek question mark vs. semicolon), or if it's a strictly international codebase, checking for anything outside ASCII. So when you don't really trust a codebase and want to establish a baseline, basically.

      But I also included other features, like checking line ending consistency, line indentation consistency, line lengths, POSIX compliance, and encoding validity. Line lengths were of particular interest to me, having seen some malicious PRs recently to FOSS projects where the attacker would just move the payload out of sight to the side, expecting most people to have word wrap off and just not even notice (pretty funny tbf).