Comment by eru

Comment by eru a day ago

9 replies

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

account42 a day ago

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

degamad a day ago

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

  • amake a day ago

    That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

    • degamad 16 hours ago

      Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.

  • bawolff a day ago

    That goes all the way back to the beginning

    Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

    • degamad 16 hours ago

      Agreed, we just conveniently forget about those when speaking about how complex Unicode is.

spyrja a day ago

True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

  • eru a day ago
    • spyrja 20 hours ago

      In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.

      But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.