Comment by Rendello

Comment by Rendello 10 months ago

3 replies

The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045

- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...

- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?

int_19h 10 months ago

Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.

The others are much worse in this regard, since they actually lose meaningful information.

zokier 10 months ago

Seems like lot of these would be taken care by normalization though? Pre-composed characters are bit of a mess.

I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.

  • Rendello 10 months ago

    These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.

    Someone pointed out the canonical source, which I'll have to look at more closely:

    https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt