Comment by eru

Comment by eru a day ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

Reply View 0 replies

degamad a day ago

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

Reply View 4 replies

amake a day ago

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

Reply View | 1 reply
- degamad 16 hours ago
  
  Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.
  
  Reply View | 0 replies
bawolff a day ago

That goes all the way back to the beginning
Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

Reply View | 1 reply
- degamad 16 hours ago
  
  Agreed, we just conveniently forget about those when speaking about how complex Unicode is.
  
  Reply View | 0 replies

spyrja a day ago

True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

Reply View 2 replies

eru a day ago

You might like https://fsharpforfunandprofit.com/series/property-based-test...

Reply View | 1 reply
- spyrja 20 hours ago
  
  In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.
  But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.
  
  Reply View | 0 replies