Comment by xyzzyz

Comment by xyzzyz 2 days ago

13 replies

> There are a lot of operations that are valid and well-defined on binary strings, such as (...), and so on.

Every single thing you listed here is supported by &[u8] type. That's the point: if you want to operate on data without assuming it's valid UTF-8, you just use &[u8] (or allocating Vec<u8>), and the standard library offers what you'd typically want, except of the functions that assume that the string is valid UTF-8 (like e.g. iterating over code points). If you want that, you need to convert your &[u8] to &str, and the process of conversion forces you to check for conversion errors.

maxdamantus a day ago

The problem is that there are so many functions that unnecessarily take `&str` rather than `&[u8]` because the expectation is that textual things should use `&str`.

So you naturally write another one of these functions that takes a `&str` so that it can pass to another function that only accepts `&str`.

Fundamentally no one actually requires validation (ie, walking over the string an extra time up front), we're just making it part of the contract because something else has made it part of the contract.

  • kragen a day ago

    It's much worse than that—in many cases, such as passing a filename to a program on the Linux command line, correct behavior requires not validating, so erroring out when validation fails introduces bugs. I've explained this in more detail in https://news.ycombinator.com/item?id=44991638.

kragen 2 days ago

That's semantically okay, but giving &str such a short name creates a dangerous temptation to use it for things such as filenames, stdio, and command-line arguments, where that process of conversion introduces errors into code that would otherwise work reliably for any non-null-containing string, as it does in Go. If it were called something like ValidatedUnicodeTextSlice it would probably be fine.

  • adastra22 a day ago

    I'd agree if it was &[bytes] or whatever. But &[u8] isn't that much different from &str.

    • kragen a day ago

      Isn't &[u8] what you should be using for command-line arguments and filenames and whatnot? In that case you'd want its name to be short, like &[u8], rather than long like &[bytes] or &[raw_uncut_byte] or something.

      • adastra22 a day ago

        OsStr/OsString is what you would use in those circumstances. Path/PathBuf specifically for filenames or paths, which I think uses OsStr/OsString internally. I've never looked at OsStr's internals but I wouldn't be surprised if it is a wrapper around &[u8].

        Note that &[u8] would allow things like null bytes, and maybe other edge cases.

  • xyzzyz a day ago

    It's actually extremely hard to introduce problems like that, precisely because Rust's standard library is very well designed. Can you give an example scenario where it would be a problem?

    • kragen a day ago

      Well, for example, the extremely exotic scenario of passing command-line arguments to a program on little-known operating systems like Linux and FreeBSD; https://doc.rust-lang.org/book/ch12-01-accepting-command-lin... recommends:

        use std::env;
      
        fn main() {
            let args: Vec<String> = env::args().collect();
            ...
        }
      
      When I run this code, a literal example from the official manual, with this filename I have here, it panics:

          $ ./main $'\200'
          thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "\x80"', library/std/src/env.rs:805:51
          note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
      
      ($'\200' is bash's notation for a single byte with the value 128. We'll see it below in the strace output.)

      So, literally any program anyone writes in Rust will crash if you attempt to pass it that filename, if it uses the manual's recommended way to accept command-line arguments. It might work fine for a long time, in all kinds of tests, and then blow up in production when a wild file appears with a filename that fails to be valid Unicode.

      This C program I just wrote handles it fine:

        #include <unistd.h>
        #include <fcntl.h>
        #include <stdio.h>
        #include <stdlib.h>
      
        char buf[4096];
      
        void
        err(char *s)
        {
          perror(s);
          exit(-1);
        }
      
        int
        main(int argc, char **argv)
        {
          int input, output;
          if ((input = open(argv[1], O_RDONLY)) < 0) err(argv[1]);
          if ((output = open(argv[2], O_WRONLY | O_CREAT, 0666)) < 0) err(argv[2]);
          for (;;) {
            ssize_t size = read(input, buf, sizeof buf);
            if (size < 0) err("read");
            if (size == 0) return 0;
            ssize_t size2 = write(output, buf, (size_t)size);
            if (size2 != size) err("write");
          }
        }
      
      (I probably should have used O_TRUNC.)

      Here you can see that it does successfully copy that file:

          $ cat baz
          cat: baz: No such file or directory
          $ strace -s4096 ./cp $'\200' baz
          execve("./cp", ["./cp", "\200", "baz"], 0x7ffd7ab60058 /* 50 vars */) = 0
          brk(NULL)                               = 0xd3ec000
          brk(0xd3ecd00)                          = 0xd3ecd00
          arch_prctl(ARCH_SET_FS, 0xd3ec380)      = 0
          set_tid_address(0xd3ec650)              = 4153012
          set_robust_list(0xd3ec660, 24)          = 0
          rseq(0xd3ecca0, 0x20, 0, 0x53053053)    = 0
          prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=9788*1024, rlim_max=RLIM64_INFINITY}) = 0
          readlink("/proc/self/exe", ".../cp", 4096) = 22
          getrandom("\xcf\x1f\xb7\xd3\xdb\x4c\xc7\x2c", 8, GRND_NONBLOCK) = 8
          brk(NULL)                               = 0xd3ecd00
          brk(0xd40dd00)                          = 0xd40dd00
          brk(0xd40e000)                          = 0xd40e000
          mprotect(0x4a2000, 16384, PROT_READ)    = 0
          openat(AT_FDCWD, "\200", O_RDONLY)      = 3
          openat(AT_FDCWD, "baz", O_WRONLY|O_CREAT, 0666) = 4
          read(3, "foo\n", 4096)                  = 4
          write(4, "foo\n", 4)                    = 4
          read(3, "", 4096)                       = 0
          exit_group(0)                           = ?
          +++ exited with 0 +++
          $ cat baz
          foo
      
      The Rust manual page linked above explains why they think introducing this bug by default into all your programs is a good idea, and how to avoid it:

      > Note that std::env::args will panic if any argument contains invalid Unicode. If your program needs to accept arguments containing invalid Unicode, use std::env::args_os instead. That function returns an iterator that produces OsString values instead of String values. We’ve chosen to use std::env::args here for simplicity because OsString values differ per platform and are more complex to work with than String values.

      I don't know what's "complex" about OsString, but for the time being I'll take the manual's word for it.

      So, Rust's approach evidently makes it extremely hard not to introduce problems like that, even in the simplest programs.

      Go's approach doesn't have that problem; this program works just as well as the C program, without the Rust footgun:

        package main
      
        import (
                "io"
                "log"
                "os"
        )
      
        func main() {
                src, err := os.Open(os.Args[1])
                if err != nil {
                        log.Fatalf("open source: %v", err)
                }
      
                dst, err := os.OpenFile(os.Args[2], os.O_CREATE|os.O_WRONLY, 0666)
                if err != nil {
                        log.Fatalf("create dest: %v", err)
                }
      
                if _, err := io.Copy(dst, src); err != nil {
                        log.Fatalf("copy: %v", err)
                }
        }
      
      (O_CREATE makes me laugh. I guess Ken did get to spell "creat" with an "e" after all!)

      This program generates a much less clean strace, so I am not going to include it.

      You might wonder how such a filename could arise other than as a deliberate attack. The most common scenario is when the filenames are encoded in a non-Unicode encoding like Shift-JIS or Latin-1, followed by disk corruption, but the deliberate attack scenario is nothing to sneeze at either. You don't want attackers to be able to create filenames your tools can't see, or turn to stone if they examine, like Medusa.

      Note that the log message on error also includes the ill-formed Unicode filename:

        $ ./cp $'\201' baz
        2025/08/22 21:53:49 open source: open ζ: no such file or directory
      
      But it didn't say ζ. It actually emitted a byte with value 129, making the error message ill-formed UTF-8. This is obviously potentially dangerous, depending on where that logfile goes because it can include arbitrary terminal escape sequences. But note that Rust's UTF-8 validation won't protect you from that, or from things like this:

        $ ./cp $'\n2025/08/22 21:59:59 oh no' baz
        2025/08/22 21:59:09 open source: open 
        2025/08/22 21:59:59 oh no: no such file or directory
      
      I'm not bagging on Rust. There are a lot of good things about Rust. But its string handling is not one of them.
      • anarki8 a day ago

        There might be potential improvements, like using OsString by default for `env::args()` but I would pick Rust's string handling over Go’s or C's any day.

        • kragen 18 hours ago

          It's reasonable to argue that C's string handling is as bad as Rust's, or worse.