Comment by Rendello
The other day I posted similar tables/scripts for related character properties and there was some good discussion: https://news.ycombinator.com/item?id=42014045
- Unicode codepoints that expand or contract when case is changed in UTF-8: https://gist.github.com/rendello/d37552507a389656e248f3255a6...
- Unicode roundtrip-unsafe characters: https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...
For example, if we do uppercase→lower→upper, some characters don't survive the roundtrip:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
I'm using the scripts to build out a little automated-testing generator library, something like "Tricky Unicode/UTF-8 case-change characters". Any other weird case quirks anyone can think of to put in the generators?
Note that semantic meaning for the second case is preserved - whether you use a precomposed symbol for capital I with overdot, or a combining character for the latter, it's supposed to be the same thing.
The others are much worse in this regard, since they actually lose meaningful information.