Comment by alexvitkov

Comment by alexvitkov 8 days ago

16 replies

So taking the first character of a word and uppercasing it is wrong because you'd get "dzen" -> "DZen".

I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.

There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.

ccppurcell 8 days ago

This is just your monoculture speaking. Transliterations between alphabets are actually mentioned in the article, did you read it? Nobody added anything to their alphabet, alphabets are invented and then grow and shrink organically.

  • alexvitkov 8 days ago

    Bringing up "monoculture" here is hilarious, as this whole situation is a direct consequence of a people attempting to enforce just that by replacing their native Cyrillic alphabet with the Latin one.

    My native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:

      ш -> sh
      щ -> sht
      я -> ya
    
    Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.
    • int_19h 7 days ago

      The native alphabet for most Southern Slavs would be Glagolitic - indeed, Croatians still occasionally used that in religious contexts as late as 19th century. Cyrillic alphabet is more or less Glagolitic with new and distinct letter shapes replaced by Greek ones, so it is in an of itself a product of the same process that you are complaining about; it just happened a few centuries earlier than the transition to Latin, so you're accustomed to its outcome being the normal.

      I should also note that it's not like Cyrillic doesn't have its share of digraphs - that's what combinations like нь effectively are, since they signify a single phoneme. And, conversely, it's pretty obvious that you can have a Latin-based orthography with no digraphs at all, just diacritics.

      This whole situation has to do with legacy encodings and not much else.

      • alexvitkov 7 days ago

        > The native alphabet for most Southern Slavs would be Glagolitic

        That's a bit of an exaggeration, the Glagolitic script was only ever used by scholars, the earliest Cyrillic writings are not not even 50 years older than the Glagolitic.

        You're right that the Cyrillic is indeed way closer to the Greek alphabet than the Glagolitic, despite being named after Cyril. I'm not complaining about the "forsaking of culture", I just found it interesting that I was being "mono-cultural" for disagreeing with the existence of a few weird Unicode code-points that themselves are a direct result of someone's attempt to move towards a "mono-culture".

        What I'm complaining against, if anything, are overly complex standards. This is just one of what's probably 100 different quirks that you should be aware of when working with Unicode text, and this one could've been easily avoided by just not including a few useless characters.

        • int_19h 7 days ago

          Unicode is supposed to be able to represent basically everything humans ever wrote, that's why we have things like https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) in there, and why it's inevitably so complex. These aren't even particularly weird codepoints when you look at some other scripts like Arabic or traditional Mongolian.

          Correctly supporting the entirety of Unicode faithfully in this sense has been unreachable for your average app for a very long time now, IMO, so it's fine to just do the best you can (i.e. usually, the most you can defer to the libraries) for the audience that you actually have or anticipate for convoluted stuff like this. I don't think that correctly handling casing for legacy digraph codepoints is something that many people need in practice, not even speakers of languages whence those Unicode digraphs originate.

          It's still a massive improvement for interop because at least you can be sure that any two apps that need the symbol will use the same encoding for it and will be able to exchange that data, even if nobody truly supports the whole thing.

    • notpushkin 7 days ago

      This exactly. Digraphs should just be deprecated and normalized to two code points.

  • f1shy 8 days ago

    There are other ways around without making the standard impossible to get right. Great, we have a standard that can cope with any alphabet... oh pitty that is impossible to write programs that use it correctly.

    • ks2048 8 days ago

      It's tricky, but that's why nearly all of the time, you should use standard libraries. E.g., in Python, ".upper()" and ".capitalize()" does the work for you.

      • [removed] 7 days ago
        [deleted]
      • f1shy 7 days ago

        Does it have titleize() ?

int_19h 7 days ago

In practice, all languages that use digraphs and trigraphs don't use distinct Unicode codepoints for them, generally speaking (and Unicode specifically marks those codepoints as legacy, so this is an officially blessed practice). The reason why they exist is because one of the explicit goals of Unicode as originally designed was to be able to roundtrip many existing national encodings lossless. So digraphs that were already in the national encodings for whatever reason ended up in Unicode as legacy, while those that were not, did not.

zokier 8 days ago

While I do have some reservations about Unicode I think its important to note that nobody forces you to deal with all of it. I think programmers should embrace the idea of picking subsets of Unicode that they know how to handle correctly, instead of trying (and failing) to handle everything. DIN 91379 is one good example https://en.wikipedia.org/wiki/DIN_91379

Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.

ks2048 8 days ago

I agree in some cases, but note that lots of the ugly and weird things in Unicode are there for backwards compatibility with older encodings.

AlotOfReading 8 days ago

The purpose of Unicode is to encode written text. There's an inherent level of complexity that comes with that, like the fact that not all languages obey the same rules as English. If you don't want to deal with text from other systems, don't accept anything except ASCII/the basic Latin block and be upfront about it.