Comment by xyzzyz

Comment by xyzzyz a day ago

3 replies

> If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

Yes, and that's a good thing. It allows every code that gets &str/String to assume that the input is valid UTF-8. The alternative would be that every single time you write a function that takes a string as an argument, you have to analyze your code, consider what would happen if the argument was not valid UTF-8, and handle that appropriately. You'd also have to redo the whole analysis every time you modify the function. That's a horrible waste of time: it's much better to:

1) Convert things to String early, and assume validity later, and

2) Make functions that explicitly don't care about validity take &[u8] instead.

This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

maxdamantus a day ago

> This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

Doesn't this demonstrate my point? If you can do everything with &[u8], what's the point in validating UTF-8? It's just a less universal string type, and your program wastes CPU cycles doing unnecessary validation.

  • matt_kantor a day ago

    > except things that do require you to assume it's valid UTF-8

    That's the point.

    • maxdamantus 18 hours ago

      But no one has demonstrated an actual operation that requires valid UTF-8. The reasoning is always circular: "I require valid UTF-8 because someone else requires valid UTF-8".

      Eventually there should be an underlying operation which can only work on valid UTF-8, but that doesn't exist. UTF-8 was designed such that invalid data can be detected and handled, without affecting the meaning of valid subsequences in the same string.