Comment by Rendello

Comment by Rendello 8 days ago

3 replies

The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045

- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...

- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:

Ω ω Ω

İ i̇ İ

K k K

Å å Å

ẞ ß SS

ϴ θ Θ

I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?

int_19h 7 days ago

Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.

The others are much worse in this regard, since they actually lose meaningful information.

zokier 8 days ago

Seems like lot of these would be taken care by normalization though? Pre-composed characters are bit of a mess.

I do feel it is a error that unit/math symbols get changed, imho they should stay as-is through case conversions.

  • Rendello 8 days ago

    These lists (and the future library) were made to test normalization and break software that made bad assumptions. I initially generated the list because I knew that some of the assumptions the parser I was writing were not solid, and sure enough the tests broke it.

    Someone pointed out the canonical source, which I'll have to look at more closely:

    https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt