Comment by xyzzyz

Comment by xyzzyz a day ago

5 replies

You can easily treat `&str` as bytes, just call `.as_bytes()`, and you get `&[u8]`, no questions asked. The reason why you don't want to treat &str as just bytes by default is that it's almost always a wrong thing to do. Moreover, it's the worst kind of a wrong thing, because it actually works correctly 99% of the time, so you might not even realize you have a bug until much too late.

If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.

inferiorhuman a day ago

  If your API takes &str, and tries to do byte-based indexing, it should
  almost certainly be taking &[u8] instead.
Str is indexed by bytes. That's the issue.
  • xyzzyz 12 hours ago

    As a matter of fact, you cannot do

      let s = “asd”;
      println!(“{}”, s[0]);
    
    You will get a compiler error telling you that you cannot index into &str.
    • inferiorhuman 10 hours ago

      Right, you have to give it a usize range. And that will index by bytes. This:

        fn main() {
            let s = "12345";
            println!("{}", &s[0..1]);
        }
      
      compiles and prints out "1".

      This:

        fn main() {
            let s = "\u{1234}2345";
            println!("{}", &s[0..1]);
        }
      
      compiles and panics with the following error:

        byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`
      
      To get the nth char (scalar codepoint):

        fn main() {
            let s = "\u{1234}2345";
            println!("{}", s.chars().nth(1).unwrap());
        }
      
      To get a substring:

        fn main() {
            let s = "\u{1234}2345";
            println!("{}", s.chars().skip(0).take(1).collect::<String>());
        }
      
      To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.:

        fn main() {
            let s = "\u{1234}2345";
            println!("{:02X?}", &s.as_bytes()[0..1]);
            println!("{:02X}", &s.as_bytes()[0]);
        }
      
      
      IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner.
      • xyzzyz 5 hours ago

        It's actually somewhat hard to hit that panic in a realistic scenario. This is because you are unlikely to be using slice indices that are not on a character boundary. Where would you even get them from? All the standard library functions will return byte indices on a character boundary. For example, if you try to do something like slice the string between first occurrence of character 'a', and of character 'z', you'll do something like

          let start = s.find('a')?;
          let end = s.find('z')?;
          let sub = &s[start..end];
        
        and it will never panic, because find will never return something that's not on a char boundary.
        • inferiorhuman 4 hours ago

            Where would you even get them from?
          
          In my case it was in parsing text where a numeric value had a two character prefix but a string value did not. So I was matching on 0..2 (actually 0..2.min(string.len()) which doubly highlights the indexing issue) which blew up occasionally depending on the string values. There are perhaps smarter ways to do this (e.g. splitn on a space, regex, giant if-else statement, etc, etc) but this seemed at first glance to be the most efficient way because it all fit neatly into a match statement.

          The inverse was also a problem: laying out text with a monospace font knowing that every character took up the same number of pixels along the x-axis (e.g. no odd emoji or whatever else). Gotta make sure to call #len on #chars instead of the string itself as some of the text (Windows-1250 encoded) got converted into multi-byte Unicode codepoints.