Comment by xg15

Comment by xg15 a day ago

I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."

You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

chrismorgan a day ago

> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.

Reply View 0 replies