Comment by alexvitkov

Comment by alexvitkov 8 months ago

So taking the first character of a word and uppercasing it is wrong because you'd get "dzen" -> "DZen".

I really wish the Unicode consortium would learn to say "No". If you added a three-letter letter to your alphabet, you can probably make do with three lettets in your text files.

There's so many characters with little to no utility and weird properties that seem to exist just to trip up programs attempting to commit the unforgivable sin of basic text manipulation.

ccppurcell 8 months ago

This is just your monoculture speaking. Transliterations between alphabets are actually mentioned in the article, did you read it? Nobody added anything to their alphabet, alphabets are invented and then grow and shrink organically.

Reply View 11 replies

alexvitkov 8 months ago
Bringing up "monoculture" here is hilarious, as this whole situation is a direct consequence of a people attempting to enforce just that by replacing their native Cyrillic alphabet with the Latin one.
My native language also happens to use a Cyrillic alphabet and has letters that would translate to multiple ones in the Latin alphabet:
ш -> sh щ -> sht я -> ya
Somehow we manage to get by without special sh, sht, and ya unicode characters, weird.
Reply View | 4 replies
- int_19h 8 months ago
  
  The native alphabet for most Southern Slavs would be Glagolitic - indeed, Croatians still occasionally used that in religious contexts as late as 19th century. Cyrillic alphabet is more or less Glagolitic with new and distinct letter shapes replaced by Greek ones, so it is in an of itself a product of the same process that you are complaining about; it just happened a few centuries earlier than the transition to Latin, so you're accustomed to its outcome being the normal.
  I should also note that it's not like Cyrillic doesn't have its share of digraphs - that's what combinations like нь effectively are, since they signify a single phoneme. And, conversely, it's pretty obvious that you can have a Latin-based orthography with no digraphs at all, just diacritics.
  This whole situation has to do with legacy encodings and not much else.
  
  Reply View | 2 replies
  
  alexvitkov 8 months ago
  
  > The native alphabet for most Southern Slavs would be Glagolitic
  That's a bit of an exaggeration, the Glagolitic script was only ever used by scholars, the earliest Cyrillic writings are not not even 50 years older than the Glagolitic.
  You're right that the Cyrillic is indeed way closer to the Greek alphabet than the Glagolitic, despite being named after Cyril. I'm not complaining about the "forsaking of culture", I just found it interesting that I was being "mono-cultural" for disagreeing with the existence of a few weird Unicode code-points that themselves are a direct result of someone's attempt to move towards a "mono-culture".
  What I'm complaining against, if anything, are overly complex standards. This is just one of what's probably 100 different quirks that you should be aware of when working with Unicode text, and this one could've been easily avoided by just not including a few useless characters.
  
  Reply View | 1 reply
  
  int_19h 8 months ago
  
  Unicode is supposed to be able to represent basically everything humans ever wrote, that's why we have things like https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) in there, and why it's inevitably so complex. These aren't even particularly weird codepoints when you look at some other scripts like Arabic or traditional Mongolian.
  Correctly supporting the entirety of Unicode faithfully in this sense has been unreachable for your average app for a very long time now, IMO, so it's fine to just do the best you can (i.e. usually, the most you can defer to the libraries) for the audience that you actually have or anticipate for convoluted stuff like this. I don't think that correctly handling casing for legacy digraph codepoints is something that many people need in practice, not even speakers of languages whence those Unicode digraphs originate.
  It's still a massive improvement for interop because at least you can be sure that any two apps that need the symbol will use the same encoding for it and will be able to exchange that data, even if nobody truly supports the whole thing.
  
  Reply View | 0 replies
- notpushkin 8 months ago
  
  This exactly. Digraphs should just be deprecated and normalized to two code points.
  
  Reply View | 0 replies
f1shy 8 months ago

There are other ways around without making the standard impossible to get right. Great, we have a standard that can cope with any alphabet... oh pitty that is impossible to write programs that use it correctly.

Reply View | 5 replies
- ks2048 8 months ago
  
  It's tricky, but that's why nearly all of the time, you should use standard libraries. E.g., in Python, ".upper()" and ".capitalize()" does the work for you.
  
  Reply View | 4 replies
  
  [removed] 8 months ago
  
  [deleted]
  
  Reply View | 0 replies
  
  f1shy 8 months ago
  
  Does it have titleize() ?
  
  Reply View | 2 replies

int_19h 8 months ago

In practice, all languages that use digraphs and trigraphs don't use distinct Unicode codepoints for them, generally speaking (and Unicode specifically marks those codepoints as legacy, so this is an officially blessed practice). The reason why they exist is because one of the explicit goals of Unicode as originally designed was to be able to roundtrip many existing national encodings lossless. So digraphs that were already in the national encodings for whatever reason ended up in Unicode as legacy, while those that were not, did not.

Reply View 0 replies

zokier 8 months ago

While I do have some reservations about Unicode I think its important to note that nobody forces you to deal with all of it. I think programmers should embrace the idea of picking subsets of Unicode that they know how to handle correctly, instead of trying (and failing) to handle everything. DIN 91379 is one good example https://en.wikipedia.org/wiki/DIN_91379

Incidentally I believe that this is kinda also the approach HN takes, there is at least some Unicode filtering going on here.

Reply View 0 replies

ks2048 8 months ago

I agree in some cases, but note that lots of the ugly and weird things in Unicode are there for backwards compatibility with older encodings.

Reply View 0 replies

AlotOfReading 8 months ago

The purpose of Unicode is to encode written text. There's an inherent level of complexity that comes with that, like the fact that not all languages obey the same rules as English. If you don't want to deal with text from other systems, don't accept anything except ASCII/the basic Latin block and be upfront about it.

Reply View 0 replies