Comment by inferiorhuman

Comment by inferiorhuman a day ago

5 replies

Problems arise when you try to take a slice of a string and end up picking an index (perhaps based on length) that would split a code point. String/str offers an abstraction over Unicode scalars (code points) via the chars iterator, but it all feels a bit messy to have the byte based abstraction more or less be the default.

FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.

xyzzyz a day ago

You can easily treat `&str` as bytes, just call `.as_bytes()`, and you get `&[u8]`, no questions asked. The reason why you don't want to treat &str as just bytes by default is that it's almost always a wrong thing to do. Moreover, it's the worst kind of a wrong thing, because it actually works correctly 99% of the time, so you might not even realize you have a bug until much too late.

If your API takes &str, and tries to do byte-based indexing, it should almost certainly be taking &[u8] instead.

  • inferiorhuman a day ago

      If your API takes &str, and tries to do byte-based indexing, it should
      almost certainly be taking &[u8] instead.
    
    Str is indexed by bytes. That's the issue.
    • xyzzyz 6 hours ago

      As a matter of fact, you cannot do

        let s = “asd”;
        println!(“{}”, s[0]);
      
      You will get a compiler error telling you that you cannot index into &str.
      • inferiorhuman 4 hours ago

        Right, you have to give it a usize range. And that will index by bytes. This:

          fn main() {
              let s = "12345";
              println!("{}", &s[0..1]);
          }
        
        compiles and prints out "1".

        This:

          fn main() {
              let s = "\u{1234}2345";
              println!("{}", &s[0..1]);
          }
        
        compiles and panics with the following error:

          byte index 1 is not a char boundary; it is inside 'ሴ' (bytes 0..3) of `ሴ2345`
        
        To get the nth char (scalar codepoint):

          fn main() {
              let s = "\u{1234}2345";
              println!("{}", s.chars().nth(1).unwrap());
          }
        
        To get a substring:

          fn main() {
              let s = "\u{1234}2345";
              println!("{}", s.chars().skip(0).take(1).collect::<String>());
          }
        
        To actually get the bytes you'd have to call #as_bytes which works with scalar and range indices, e.g.:

          fn main() {
              let s = "\u{1234}2345";
              println!("{:02X?}", &s.as_bytes()[0..1]);
              println!("{:02X}", &s.as_bytes()[0]);
          }
        
        
        IMO it's less intuitive than it should be but still less bad than e.g. Go's two types of nil because it will fail in a visible manner.
toast0 a day ago

> but it all feels a bit messy to have the byte based abstraction more or less be the default.

I mean, really neither should be the default. You should have to pick chars or bytes on use, but I don't think that's palatable; most languages have chosen one or the other as the preferred form. Or some have the joy of being forward thinking in the 90s and built around UCS-2 and later extended to UTF-16, so you've got 16-bit 'characters' with some code points that are two characters. Of course, dealing with operating systems means dealing with whatever they have as well as what the language prefers (or, as discussed elsewhere in this thread, pretending it doesn't exist to make easy things easier and hard things harder)