Comment by NoahZuniga
Comment by NoahZuniga 2 days ago
I guess its kind of annoying that letters with diacritics can be represented in multiple different ways
Comment by NoahZuniga 2 days ago
I guess its kind of annoying that letters with diacritics can be represented in multiple different ways
String comparison is a difficult problem. Consider:
Å (ANGSTROM SIGN)
Å (LATIN CAPITAL LETTER A WITH RING ABOVE)
Å (LATIN CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)
А̊ (CYRILLIC CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)
Of these, the Angstrom Sign is considered deprecated and won't show up in any normal forms. The second is the NFC (composed) form, and the third is the NFD (decomposed) form. The Cyrillic one looks the same, but is not the same abstract character, so isn't connected in any normalization form.
Normal forms also reorder the diacritics if there are multiple. The strings could be compared through their normalized encoded forms (like UTF-8), which I think is what you meant, or their normalized code points directly. I agree it can be messy, but I'm curious what you meant by dubious, do you think there's a better way?
That's true, and even with normalization, there's four normalized forms for strings. The -k- forms are mostly for searching, but that still leaves NFC and NFD.
The normalization forms are explained, in order of approachability (imo), in this random Youtube video, the Unicode Annex #15, and the Unicode Core Spec:
https://www.youtube.com/watch?v=ttLD4DiMpiQ
https://unicode.org/reports/tr15/
https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...