Comment by kps

Comment by kps 10 hours ago

7 replies

Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

zahlman 7 hours ago

They should have at least all used a single system. Instead, we have:

* European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)

* Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)

* Emoji that are produced by applying a "variation selector" to a normal character

* Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance

* Some subset of emoji that can have a skin-tone modifier applied as a direct suffix

* Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available

* National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)

* Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used

* Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences

And surely more that I've forgotten about or not learned about yet.

layer8 9 hours ago

One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.

Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”

  • kps 9 hours ago

    Sorting is a good point.

    On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.

    • layer8 9 hours ago

      Keyboard input handling at that level generally isn’t character-based, and instead requires looking at scancodes and modifier keys, and sometimes also distinguishing between keyup and keydown events.

      You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.

      • kps 9 hours ago

        Depends on the system. X11/Wayland do it at a higher level where you have `<dead_acute> <e> : eacute` and keysyms are effectively a superset of Unicode with prefix combiners. (This can lead to weirdness since the choice of Compose rules is orthogonal to the choice of keyboard layout.)

        • layer8 8 hours ago

          I guess your conception is that one could then define

              <dead_acute> : <combining_acute_accent>
          
          instead and use it for arbitrary letters. However, that would fail in locales using a non-Unicode encoding such as iso-8859-1 that only contain the combined character. Unless you have the input system post-process the mapped input again to normalize it to e.g. NFC before passing it on to the application, in which case the combination has to be reparsed anyway. So I don’t see what would be gained with regard to ease of parsing.

          If you want to define such a key, you can probably still do it, you’ll just have to press it in the opposite order and use backspace if you want to cancel it.

          The fact that dead keys happen to be prefix is in principle arbitrary, they could as well be suffix. On physical typewriters, suffix was more customary I think, i.e. you’d backspace over the character you want to accent and type the accent on top of it. To obtain just the accent, you combine it with Space either way.

    • dcrazy 8 hours ago

      Not all input methods use dead keys to emit combining characters.