Comment by DavidPiper
Comment by DavidPiper a day ago
I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Taking this one step further -- there's no such thing as the context-free length of a string.
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?