Comment by 0x000xca0xfe

Comment by 0x000xca0xfe 2 days ago

Thanks for your reply. I understand that encoding the character set in the type system is more explicit and can help find bugs.

But forcing all strings to be UTF-8 does not magically help with the issue you described. In practice I've often seen the opposite: Now you have to write two code paths, one for UTF-8 and one for everything else. And the second one is ignored in practice because it is annoying to write. For example, I built the web server project in your other submission (very cool!) and gave it a tar file that has a non-UTF-8 name. There is no special handling happening, I simply get "error: invalid UTF-8 was detected in one or more arguments" and the application exits. It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

Forcing UTF-8 does not "fix" compatibility in strange edge cases, it just breaks them all. The best approach is to treat data as opaque bytes unless there is a good reason not to. Which is what Go does, so I think it is unfair to blame Go for this particular reason instead of the backup applications.

thinkharderdev 2 days ago

> It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

You can debate whether it is sloppy but I think an error is much better than silently corrupting data.

> The best approach is to treat data as opaque bytes unless there is a good reason not to

This doesn't seem like a good approach when dealing with strings which are not just blobs of bytes. They have an encoding and generally you want ways to, for instance, convert a string to upper/lowercase.

Reply View 0 replies

thomashabets2 2 days ago

Can't say I know the best way here. But Rust does this better than anything I've seen.

I don't think you need two code paths. Maybe your program can live its entire life never converting away from the original form. Say you read from disk, pick out just the filename, and give to an archive library.

There's no need to ever convert that to a "string". Yes, it could have been a byte array, but taking out the file name (or maybe final dir plus file name) are string operations, just not necessarily on UTF-8 strings.

And like I said, for all use cases where it just needs to be shown to users, the "lossy" version is fine.

> I simply get "error: invalid UTF-8 was detected in one or more arguments" and the application exits. It just refuses to work with non-UTF-8 files at all -- is this less sloppy?

Haha, touche. But yes, it's less sloppy. Would you prefer that the files were silently skipped? You've created your archive, you started the webserver, but you just can't get it to deliver the page you want.

In order for tarweb to support non-UTF-8 in filenames, the programmer has to actually think about what that means. I don't think it means doing a lossy conversion, because that's not what the file name was, and it's not merely for human display. And it should probably not be the bytes either, because tools will likely want to send UTF-8 encoded.

Or they don't. In either case unless that's designed, implemented, and tested, non-UTF-8 in filenames should probably be seen as malformed input. For something that uses a tarfile for the duration of the process's life, that probably means rejecting it, and asking the user to roll back to a previous working version or something.

> Forcing UTF-8 does not "fix" compatibility in strange edge cases

Yup. Still better than silently corrupting.

Compare this to how for Rust kernel work they apparently had to implement a new Vec equivalent, because dealing with allocation failures is a different thing in user and kernel space[1], and Vec push can't fail.

Similarly, Go string operations cannot fail. And memory allocation issues has reasons that string operations don't.

[1] a big separate topic. Nobody (almost) runs with overcommit off.

Reply View 2 replies

0x000xca0xfe 2 days ago

An error is better than silent corruption, sure.
But there is no silent corruption when you pass the data as opaque bytes, you just get some placeholder symbols when displayed. This is how I see the file in my terminal and I can rm it just fine.
And yes, question marks in the terminal are way better than applications not working at all.
The case of non-UTF-8 being skipped is usually a characteristic of applications written in languages that don't use bytes for their default string type, not the other way around. This has bitten me multiple times with Python2/3 libraries.

Reply View | 0 replies
inferiorhuman 2 days ago
There's no need to ever convert that to a "string".
Until you run into a crate that wants the filename in String form.
Reply View | 0 replies