Comment by bjourne

String comparison is a difficult problem. Consider:

Å (ANGSTROM SIGN)

Å (LATIN CAPITAL LETTER A WITH RING ABOVE)

Å (LATIN CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

А̊ (CYRILLIC CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

Of these, the Angstrom Sign is considered deprecated and won't show up in any normal forms. The second is the NFC (composed) form, and the third is the NFD (decomposed) form. The Cyrillic one looks the same, but is not the same abstract character, so isn't connected in any normalization form.

Normal forms also reorder the diacritics if there are multiple. The strings could be compared through their normalized encoded forms (like UTF-8), which I think is what you meant, or their normalized code points directly. I agree it can be messy, but I'm curious what you meant by dubious, do you think there's a better way?