Comment by Rendello

Comment by Rendello 2 days ago

2 replies

That's true, and even with normalization, there's four normalized forms for strings. The -k- forms are mostly for searching, but that still leaves NFC and NFD.

The normalization forms are explained, in order of approachability (imo), in this random Youtube video, the Unicode Annex #15, and the Unicode Core Spec:

https://www.youtube.com/watch?v=ttLD4DiMpiQ

https://unicode.org/reports/tr15/

https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

bjourne 2 days ago

Comparing strings by bytecode equality is kinda dubious anyway.

  • Rendello a day ago

    String comparison is a difficult problem. Consider:

    Å (ANGSTROM SIGN)

    Å (LATIN CAPITAL LETTER A WITH RING ABOVE)

    Å (LATIN CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

    А̊ (CYRILLIC CAPITAL LETTER A) + (◌̊ COMBINING RING ABOVE)

    Of these, the Angstrom Sign is considered deprecated and won't show up in any normal forms. The second is the NFC (composed) form, and the third is the NFD (decomposed) form. The Cyrillic one looks the same, but is not the same abstract character, so isn't connected in any normalization form.

    Normal forms also reorder the diacritics if there are multiple. The strings could be compared through their normalized encoded forms (like UTF-8), which I think is what you meant, or their normalized code points directly. I agree it can be messy, but I'm curious what you meant by dubious, do you think there's a better way?