Comment by thaumasiotes

Comment by thaumasiotes 17 hours ago

7 replies

> If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.

There is a deeper bug within Unicode.

The Turkish letter TURKISH CAPITAL LETTER DOTLESS I is represented as the code point U+0049, which is named LATIN CAPITAL LETTER I.

The Greek letter GREEK CAPITAL LETTER IOTA is represented as the code point U+0399, named... GREEK CAPITAL LETTER IOTA.

The relationship between the Greek letter I and the Roman letter I is identical in every way to the relationship between the Turkish letter dotless I and the Roman letter I. (Heck, the lowercase form is also dotless.) But lowercasing works on GREEK CAPITAL LETTER IOTA because it has a code point to call its own.

Should iota have its own code point? The answer to that question is "no": it is, by definition, drawn identically to the ascii I. But Unicode has never followed its principles. This crops up again and again and again, everywhere you look. (And, in "defense" of Unicode, it has several principles that directly contradict each other.)

Then people come to rely on behavior that only applies to certain buggy parts of Unicode, and get messed up by parts that don't share those particular bugs.

layer8 16 hours ago

It’s not a bug, it’s a feature. The reason is that ISO 8859-7 [0] used for Greek has a separate character code for Iota (for all greek letters, really), while ISO 8859-3 [1] and -9 [2] used for Turkish do not for the usual dotless uppercase I.

One important goal of Unicode is to be able to convert from existing character sets to Unicode (and back) without having to know the language of the text that is being converted. If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.

[0] https://en.wikipedia.org/wiki/ISO/IEC_8859-7

[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-3

[2] https://en.wikipedia.org/wiki/ISO/IEC_8859-9

  • thaumasiotes 14 hours ago

    I know that. That's why I mentioned

    > in "defense" of Unicode, it has several principles that directly contradict each other

    Unicode wants to do several things, and they aren't mutually compatible. It is premised on the idea that you can be all things to all people.

    > It’s not a bug, it’s a feature.

    It is a bug. It directly violates Unicode's stated principles. It's also a feature, but that won't make it not a bug.

  • newpavlov 4 hours ago

    >If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.

    Great. So now we have to know locale for handling case conversion for probably centuries to come, but it was totally worth to save a bit of effort in the relatively short transition phase. /s

    • JuniperMesos 14 minutes ago

      You always have to know locale to handle case conversion - this is not actually defined the same way in different human languages and it is a mistake to pretend it is.

      • newpavlov 8 minutes ago

        In most cases locale is encoded in character itself, i.e. Latin "a" and Cyrillic "a" are two different characters, despite being visually indistinguishable in most cases.

        The "language-sensitive" section of the special casing document [0] is extremely small and contains only the cases of stupid reuse of Latin I.

        [0]: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing....

    • fhars 4 hours ago

      Without it, there would not have been a transition phase.

      • newpavlov 3 hours ago

        I call BS. Without a series of MAJOR blunders Unicode was destined to succeed. When the rest of the world has migrated to Unicode, I am more than certain that Turks would've migrated as well. Yes, they may have complained for several years and would've spent a minuscule amount of resources to adopt the conversion software, but that's it, a decade or two later everyone would've forgotten about it.

        I believe that even addition of emojis was completely unnecessary despite the pressure from Japanese telecoms. Today's landscape of messengers only confirms that.