Comment by perching_aix
Comment by perching_aix 15 hours ago
As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.
I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.
Now of course:
- it coming in handy once for my specific random workload doesn't mean it's good design
- my specific workload may not be rational (am a dingus sometimes)
- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome
- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.
But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.
I'm curious what you mean by "shenanigans" is that like emojis and zalgo text?