Comment by spyrja

Comment by spyrja a day ago

I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?

danhau a day ago

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

Reply View 8 replies

xg15 a day ago

I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)
Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.
E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.
So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

Reply View | 0 replies

spyrja a day ago

Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }

Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.

Reply View 6 replies

simonask a day ago

That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.
Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...
It even includes an optimized fast path for ASCII, and it works at compile-time as well.

Reply View | 5 replies
- spyrja a day ago
  
  Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!
  
  Reply View | 0 replies
- koakuma-chan a day ago
  
  > You're also apparently insisting on three-letter variable names
  Why are the arguments not three-letter though? I would feel terrible if that was my code.
  
  Reply View | 3 replies

guappa a day ago

Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.

Reply View 0 replies

eru a day ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

Reply View 9 replies

account42 a day ago

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

Reply View | 0 replies
degamad a day ago

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

Reply View | 4 replies
- amake a day ago
  
  That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.
  
  Reply View | 1 reply
  
  degamad 16 hours ago
  
  Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.
  
  Reply View | 0 replies
- bawolff a day ago
  
  That goes all the way back to the beginning
  Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.
  
  Reply View | 1 reply
  
  degamad 16 hours ago
  
  Agreed, we just conveniently forget about those when speaking about how complex Unicode is.
  
  Reply View | 0 replies
spyrja a day ago

True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

Reply View | 2 replies
- eru a day ago
  
  You might like https://fsharpforfunandprofit.com/series/property-based-test...
  
  Reply View | 1 reply
  
  spyrja 20 hours ago
  
  In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.
  But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.
  
  Reply View | 0 replies

kalleboo a day ago

I think what you meant is we should all go back to the simplicity of Shift-JIS

Reply View 0 replies

Ekaros a day ago

Should have just gone with 32 bit characters and no combinations. Utter simplicity.

Reply View 8 replies

guappa a day ago

That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.

Reply View | 2 replies
- Ekaros a day ago
  
  Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.
  
  Reply View | 1 reply
  
  guappa a day ago
  
  yeah that's how ascii works… there's 1 bit for lower/upper case.
  
  Reply View | 0 replies
bawolff a day ago

I think combining characters are a lot simpler than having every single combination ever.
Especially when you start getting into non latin-based languages.

Reply View | 0 replies
amake a day ago

What does "no combinations" mean?

Reply View | 3 replies
- Ekaros a day ago
  
  Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.
  
  Reply View | 2 replies
  
  amake a day ago
  
  That's fundamental to the mission of Unicode because Unicode is meant to be compatible with all legacy character sets, and those character sets already included combining characters.
  So "no combinations" was never going to happen.
  
  Reply View | 0 replies
  
  int_19h 18 hours ago
  
  That quickly explodes if you need more than one diacritic per letter (e.g. Vietnamese often has two, and then there's https://en.wikipedia.org/wiki/International_Phonetic_Alphabe...).
  
  Reply View | 0 replies