Comment by xg15

Comment by xg15 a day ago

3 replies

The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.

Therefore, people should use codepoints for things like length limits or database indexes.

But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?

If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?

Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

re a day ago

What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)

> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

  • xg15 a day ago

    I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

    "For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

    You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."

    You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

    But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

    > Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

    Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

    • chrismorgan a day ago

      > it doesn't say "codepoints" as an alternative solution. That was just my assumption …

      On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

      > The problem will be the same if you have to reconstruct the grapheme clusters eventually.

      In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

      > You don't want that if you e.g. have an index for fulltext search.

      Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.