Comment by chrismorgan

Comment by chrismorgan a day ago

0 replies

> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.