Comment by afiori

Comment by afiori 2 days ago

2 replies

I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

account42 2 days ago

But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

  • afiori 2 days ago

    The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.

    My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.

    I think that the current status quo is better than what came before, but I also think it could be improved.