Comment by maxdamantus

Comment by maxdamantus 2 days ago

23 replies

But there's no option to just construct the string with the invalid bytes. 3) is not for this purpose; it is for when you already know that it is valid.

If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

https://doc.rust-lang.org/std/primitive.str.html#invariant

> Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.

gf000 2 days ago

How could any library function work with completely random bytes? Like, how would it iterate over code points? It may want to assume utf8's standard rules and e.g. know that after this byte prefix, the next byte is also part of the same code point (excuse me if I'm using wrong terminology), but now you need complex error handling at every single line, which would be unnecessary if you just made your type represent only valid instances.

Again, this is the same simplistic, vs just the right abstraction, this just smudges the complexity over a much larger surface area.

If you have a byte array that is not utf-8 encoded, then just... use a byte array.

  • kragen 2 days ago

    There are a lot of operations that are valid and well-defined on binary strings, such as sorting them, hashing them, writing them to files, measuring their lengths, indexing a trie with them, splitting them on delimiter bytes or substrings, concatenating them, substring-searching them, posting them to ZMQ as messages, subscribing to them as ZMQ prefixes, using them as keys or values in LevelDB, and so on. For binary strings that don't contain null bytes, we can add passing them as command-line arguments and using them as filenames.

    The entire point of UTF-8 (designed, by the way, by the group that designed Go) is to encode Unicode in such a way that these byte string operations perform the corresponding Unicode operations, precisely so that you don't have to care whether your string is Unicode or just plain ASCII, so you don't need any error handling, except for the rare case where you want to do something related to the text that the string semantically represents. The only operation that doesn't really map is measuring the length.

    • xyzzyz 2 days ago

      > There are a lot of operations that are valid and well-defined on binary strings, such as (...), and so on.

      Every single thing you listed here is supported by &[u8] type. That's the point: if you want to operate on data without assuming it's valid UTF-8, you just use &[u8] (or allocating Vec<u8>), and the standard library offers what you'd typically want, except of the functions that assume that the string is valid UTF-8 (like e.g. iterating over code points). If you want that, you need to convert your &[u8] to &str, and the process of conversion forces you to check for conversion errors.

      • maxdamantus a day ago

        The problem is that there are so many functions that unnecessarily take `&str` rather than `&[u8]` because the expectation is that textual things should use `&str`.

        So you naturally write another one of these functions that takes a `&str` so that it can pass to another function that only accepts `&str`.

        Fundamentally no one actually requires validation (ie, walking over the string an extra time up front), we're just making it part of the contract because something else has made it part of the contract.

        • kragen a day ago

          It's much worse than that—in many cases, such as passing a filename to a program on the Linux command line, correct behavior requires not validating, so erroring out when validation fails introduces bugs. I've explained this in more detail in https://news.ycombinator.com/item?id=44991638.

      • kragen 2 days ago

        That's semantically okay, but giving &str such a short name creates a dangerous temptation to use it for things such as filenames, stdio, and command-line arguments, where that process of conversion introduces errors into code that would otherwise work reliably for any non-null-containing string, as it does in Go. If it were called something like ValidatedUnicodeTextSlice it would probably be fine.

    • gf000 2 days ago

      Then [u8] can surely implement those functions.

adastra22 2 days ago

I don’t understand this complaint. (3) sounds like exactly what you are asking for. And yes, doing unsafe thing is unsafe.

  • maxdamantus a day ago

    > I don’t understand this complaint. (3) sounds like exactly what you are asking for. And yes, doing unsafe thing is unsafe

    You're meant to use `unsafe` as a way of limiting the scope of reasoning about safety.

    Once you construct a `&str` using `from_utf8_unchecked`, you can't safely pass it to any other function without looking at its code and reasoning about whether it's still safe.

    Also see the actual documentation: https://doc.rust-lang.org/std/primitive.str.html#method.from...

    > Safety: The bytes passed in must be valid UTF-8.

xyzzyz a day ago

> If you use 3) to create a &str/String from invalid bytes, you can't safely use that string as the standard library is unfortunately designed around the assumption that only valid UTF-8 is stored.

Yes, and that's a good thing. It allows every code that gets &str/String to assume that the input is valid UTF-8. The alternative would be that every single time you write a function that takes a string as an argument, you have to analyze your code, consider what would happen if the argument was not valid UTF-8, and handle that appropriately. You'd also have to redo the whole analysis every time you modify the function. That's a horrible waste of time: it's much better to:

1) Convert things to String early, and assume validity later, and

2) Make functions that explicitly don't care about validity take &[u8] instead.

This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

  • maxdamantus a day ago

    > This is, of course, exactly what Rust does: I am not aware of a single thing that &str allows you to do that you cannot do with &[u8], except things that do require you to assume it's valid UTF-8.

    Doesn't this demonstrate my point? If you can do everything with &[u8], what's the point in validating UTF-8? It's just a less universal string type, and your program wastes CPU cycles doing unnecessary validation.

    • matt_kantor a day ago

      > except things that do require you to assume it's valid UTF-8

      That's the point.

      • maxdamantus 18 hours ago

        But no one has demonstrated an actual operation that requires valid UTF-8. The reasoning is always circular: "I require valid UTF-8 because someone else requires valid UTF-8".

        Eventually there should be an underlying operation which can only work on valid UTF-8, but that doesn't exist. UTF-8 was designed such that invalid data can be detected and handled, without affecting the meaning of valid subsequences in the same string.