Comment by Waterluvian

Comment by Waterluvian 14 hours ago

30 replies

I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.

The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.

csande17 14 hours ago

Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".

  • weinzierl 12 hours ago

    I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.

    I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.

    It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.

    • zahlman 11 hours ago

      A big part of the problem is that the reaction to early updates was so bad that they promised they would never un-assign or re-assign a code point ever again, making it impossible for them to actually correct any mistakes (not even typos in the official standard names given to characters).

      The versioning is actually almost completely backwards by semver reasoning; 1.1 should have been 2.0, 2.0 should have been 3.0 and we should still be on 3.n now (since they have since kept the promise not to remove anything).

    • yk 11 hours ago

      I would. The original sin of Unicode is really their manifold idea, at that point they stopped trying to write a string standard and started to become a kinda general description of how string standards should look like and hopefully string standards that more or less conform to this description are interoperable if you remember which direction "string".decode() and "string".encode() is.

    • socalgal2 11 hours ago

      What could be better? Human languages are complex

      • weinzierl 11 hours ago

        Yes, exactly, human languages are complex and in my opinion Unicode used to be on a good track to tackle these complexities. I just think that nowadays they are not doing enough to help people around the world solving these problems.

      • [removed] 11 hours ago
        [deleted]
      • pas 6 hours ago

        sure, but they have both human and machine stuff in the same "universe" - again, sure, it made sense, but maybe it would make sense to have a parser that helps to recover "human stuff" from "machine gibberish" (ie. filter out the presentation and control stuff), but, but, of course some in-band logic makes sense, after all, for the combinations (diacritics, emoji skin color, and so on).

  • estebank 13 hours ago

    The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]

    I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.

    1: https://trojansource.codes/

    • yencabulator 12 hours ago

      The enforcement is an app-level issue, depending on the semantics of the field. I agree it doesn't belong in the low-level transport protocol.

      The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.

      • Waterluvian 11 hours ago

        Can a regular expression be used to restrict Unicode chars like the ones described?

        I’m imagining a listing of regex rules for the various gotchas, and then a validation-level use that unions the ones you want.

        • fluoridation 4 hours ago

          Why would you need a regular expression for that? It's just a list of characters.

  • arp242 12 hours ago

    I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).

    • dcrazy 12 hours ago

      This is the job of the Bidi Algorithm: https://www.unicode.org/reports/tr9/

      Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.

      • acdha 11 hours ago

        Over the years I’ve run into a few situations where the rules around neutral characters didn’t produce the right result and so we had to use the override characters to force the correct display. It’s completely a niche but very handy when you are mixing quotes within a complex text.

    • layer8 11 hours ago

      Read the parent’s link. The characters “to be avoided” are a particular special-purpose subset, not directional control characters in general.

Etheryte 14 hours ago

As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.

  • kps 14 hours ago

    Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

    • zahlman 11 hours ago

      They should have at least all used a single system. Instead, we have:

      * European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)

      * Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)

      * Emoji that are produced by applying a "variation selector" to a normal character

      * Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance

      * Some subset of emoji that can have a skin-tone modifier applied as a direct suffix

      * Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available

      * National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)

      * Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used

      * Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences

      And surely more that I've forgotten about or not learned about yet.

    • layer8 13 hours ago

      One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.

      Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”

      • kps 13 hours ago

        Sorting is a good point.

        On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.

eviks 13 hours ago

Indeed, though a lot of that complexity like surrogates and control codes aren't due to attempts to write language, that's just awful designs preserved for posterity

ivanjermakov 6 hours ago

Unicode sucks, but it sucks less than every other encoding standard.