Comment by spyrja

Comment by spyrja a day ago

30 replies

I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?

danhau a day ago

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

  • xg15 a day ago

    I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

    Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

    E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

    So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

  • spyrja a day ago

    Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

      bool utf_append_plaintext(utf* result, const char* text) {
      #define msk(byte, mask, value) ((byte & mask) == value)
      #define cnt(byte) msk(byte, 0xc0, 0x80)
      #define shf(byte, mask, amount) ((byte & mask) << amount)
        utf_clear(result);
        if (text == NULL)
          return false;
        size_t siz = strlen(text);
        uint8_t* nxt = (uint8_t*)text;
        uint8_t* end = nxt + siz;
        if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
          nxt += 3;
        while (nxt < end) {
          bool aok = false;
          uint32_t cod = 0;
          uint8_t fir = nxt[0];
          if (msk(fir, 0x80, 0)) {
            cod = fir;
            nxt += 1;
            aok = true;
          } else if ((nxt + 1) < end) {
            uint8_t sec = nxt[1];
            if (msk(fir, 0xe0, 0xc0)) {
              if (cnt(sec)) {
                cod |= shf(fir, 0x1f, 6);
                cod |= shf(sec, 0x3f, 0);
                nxt += 2;
                aok = true;
              }
            } else if ((nxt + 2) < end) {
              uint8_t thi = nxt[2];
              if (msk(fir, 0xf0, 0xe0)) {
                if (cnt(sec) && cnt(thi)) {
                  cod |= shf(fir, 0x0f, 12);
                  cod |= shf(sec, 0x3f, 6);
                  cod |= shf(thi, 0x3f, 0);
                  nxt += 3;
                  aok = true;
                }
              } else if ((nxt + 3) < end) {
                uint8_t fou = nxt[3];
                if (msk(fir, 0xf8, 0xf0)) {
                  if (cnt(sec) && cnt(thi) && cnt(fou)) {
                    cod |= shf(fir, 0x07, 18);
                    cod |= shf(sec, 0x3f, 12);
                    cod |= shf(thi, 0x3f, 6);
                    cod |= shf(fou, 0x3f, 0);
                    nxt += 4;
                    aok = true;
                  }
                }
              }
            }
          }
          if (aok)
            utf_push(result, cod);
          else
            return false;
        }
        return true;
      #undef cnt
      #undef msk
      #undef shf
      }
    
    Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.
    • simonask a day ago

      That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.

      Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

      It even includes an optimized fast path for ASCII, and it works at compile-time as well.

      • spyrja a day ago

        Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!

      • koakuma-chan a day ago

        > You're also apparently insisting on three-letter variable names

        Why are the arguments not three-letter though? I would feel terrible if that was my code.

guappa a day ago

Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.

eru a day ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

  • account42 a day ago

    ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

    - requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

    - doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

    - doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

  • degamad a day ago

    It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

    • amake a day ago

      That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

      • degamad 16 hours ago

        Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.

    • bawolff a day ago

      That goes all the way back to the beginning

      Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

      • degamad 16 hours ago

        Agreed, we just conveniently forget about those when speaking about how complex Unicode is.

  • spyrja a day ago

    True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

    • eru a day ago
      • spyrja 20 hours ago

        In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.

        But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.

kalleboo a day ago

I think what you meant is we should all go back to the simplicity of Shift-JIS

Ekaros a day ago

Should have just gone with 32 bit characters and no combinations. Utter simplicity.

  • guappa a day ago

    That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.

    • Ekaros a day ago

      Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.

      • guappa a day ago

        yeah that's how ascii works… there's 1 bit for lower/upper case.

  • bawolff a day ago

    I think combining characters are a lot simpler than having every single combination ever.

    Especially when you start getting into non latin-based languages.

  • amake a day ago

    What does "no combinations" mean?

    • Ekaros a day ago

      Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.

      • amake a day ago

        That's fundamental to the mission of Unicode because Unicode is meant to be compatible with all legacy character sets, and those character sets already included combining characters.

        So "no combinations" was never going to happen.