Comment by xg15
The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?