Comment by josephg
It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.
Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.
You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.