Comment by Etheryte

Comment by Etheryte 14 hours ago

9 replies

As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.

kps 14 hours ago

Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

  • zahlman 11 hours ago

    They should have at least all used a single system. Instead, we have:

    * European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)

    * Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)

    * Emoji that are produced by applying a "variation selector" to a normal character

    * Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance

    * Some subset of emoji that can have a skin-tone modifier applied as a direct suffix

    * Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available

    * National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)

    * Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used

    * Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences

    And surely more that I've forgotten about or not learned about yet.

  • layer8 13 hours ago

    One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.

    Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”

    • kps 13 hours ago

      Sorting is a good point.

      On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.

      • layer8 13 hours ago

        Keyboard input handling at that level generally isn’t character-based, and instead requires looking at scancodes and modifier keys, and sometimes also distinguishing between keyup and keydown events.

        You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.

      • dcrazy 12 hours ago

        Not all input methods use dead keys to emit combining characters.