Ask HN: What are your Unicode woes?
5 points by Rendello 2 days ago
I've always worked with text, but I only started digging deep into understanding Unicode this year.
What do HN people have to say about Unicode and UTF-{8,16,32}? Are there parts you've never really understood? Have you had unexpected bugs due to misunderstood properties of text?
I (OP) have been working on some Unicode visualization tooling for a while now. The idea started when I had some buggy string-matching code. I was matching case-insensitively, then using those ranges to highlight the original text.
Turns out, sometimes changing case changes not only the number of bytes (in UTF-8), but the number of encoded characters! This led to my post "UTF-8 characters that behave oddly when the case is changed" [1], which inspired a lot of conversation that taught me a lot. After that, I started reading Unicode documentation in earnest, and building up an idea of what a new tool should show. I'm trying to make clear things I didn't (and sometimes still don't) understand, so I'd love to know what causes pains in the wild / gaps in people's understanding.
1. https://news.ycombinator.com/item?id=42014045