Comment by xyzzyz

Comment by xyzzyz a day ago

65 replies

This also epitomizes the issue. What's the point of having `string` type at all, if it doesn't allow you to make any extra assumptions about the contents beyond `[]byte`? The answer is that they planned to make conversion to `string` error out when it's invalid UTF-8, and then assume that `string`s are valid UTF-8, but then it caused problems elsewhere, so they dropped it for immediate practical convenience.

tialaramex a day ago

Rust apparently got relatively close to not having &str as a primitive type and instead only providing a library alias to &[u8] when Rust 1.0 shipped.

Score another for Rust's Safety Culture. It would be convenient to just have &str as an alias for &[u8] but if that mistake had been allowed all the safety checking that Rust now does centrally has to be owned by every single user forever. Instead of a few dozen checks overseen by experts there'd be myriad sprinkled across every project and always ready to bite you.

  • inferiorhuman a day ago

    Even so you end up with paper cuts like len which returns the number of bytes.

    • toast0 a day ago

      The problem with string length is there's probably at least four concepts that could conceivably be called length, and few people are happy when none of them are len.

      Of the top of my head, in order of likely difficulty to calculate: byte length, number of code points, number of grapheme/characters, height/width to display.

      Maybe it would be best for Str not to have len at all. It could have bytes, code_points, graphemes. And every use would be precise.

      • stouset a day ago

        > The problem with string length is there's probably at least four concepts that could conceivably be called length.

        The answer here isn't to throw up your hands, pick one, and other cases be damned. It's to expose them all and let the engineer choose. To not beat the dead horse of Rust, I'll point that Ruby gets this right too.

            * String#length                   # count Unicode code units
            * String#bytes#length             # count bytes
            * String#grapheme_clusters#length # count grapheme clusters
        
        Similarly, each of those "views" lets you slice, index, etc. across those concepts naturally. Golang's string is the worst of them all. They're nominally UTF-8, but nothing actually enforces it. But really they're just buckets of bytes, unless you send them to APIs that silently require them to be UTF-8 and drop them on the floor or misbehave if they're not.

        Height/width to display is font-dependent, so can't just be on a "string" but needs an object with additional context.

      • branko_d a day ago

        You could also have the number of code UNITS, which is the route C# took.

      • inferiorhuman a day ago

        Problems arise when you try to take a slice of a string and end up picking an index (perhaps based on length) that would split a code point. String/str offers an abstraction over Unicode scalars (code points) via the chars iterator, but it all feels a bit messy to have the byte based abstraction more or less be the default.

        FWIW the docs indicate that working with grapheme clusters will never end up in the standard library.

  • adastra22 a day ago

    . (early morning brain fart -- I need my coffee)

    • tialaramex a day ago

      So it's true that technically the primitive type is str, and indeed it's even possible to make a &mut str though it's quite rare that you'd want to mutably borrow the string slice.

      However no &str is not "an alias for &&String" and I can't quite imagine how you'd think that. String doesn't exist in Rust's core, it's from alloc and thus wouldn't be available if you don't have an allocator.

      • zozbot234 a day ago

        str is not really a "primitive type", it only exists abstractly as an argument to type constructors - treating the & operator as a "type constructor" for that purpose, but including Box<>, Rc<>, Arc<> etc. So you can have Box<str> or Arc<str> in addition to &str or perhaps &mut str, but not really 'str' in isolation.

0x000xca0xfe a day ago

Why not use utf8.ValidString in the places it is needed? Why burden one of the most basic data types with highly specific format checks?

It's far better to get some � when working with messy data instead of applications refusing to work and erroring out left and right.

  • const_cast a day ago

    IMO utf8 isn't a highly specific format, it's universal for text. Every ascii string you'd write in C or C++ or whatever is already utf8.

    So that means that for 99% of scenarios, the difference between char[] and a proper utf8 string is none. They have the same data representation and memory layout.

    The problem comes in when people start using string like they use string in PHP. They just use it to store random bytes or other binary data.

    This makes no sense with the string type. String is text, but now we don't have text. That's a problem.

    We should use byte[] or something for this instead of string. That's an abuse of string. I don't think allowing strings to not be text is too constraining - that's what a string is!

    • kragen a day ago

      The approach you are advocating is the approach that was abandoned, for good reasons, in the Unix filesystem in the 70s and in Perl in the 80s.

      One of the great advances of Unix was that you don't need separate handling for binary data and text data; they are stored in the same kind of file and can be contained in the same kinds of strings (except, sadly, in C). Occasionally you need to do some kind of text-specific processing where you care, but the rest of the time you can keep all your code 8-bit clean so that it can handle any data safely.

      Languages that have adopted the approach you advocate, such as Python, frequently have bugs like exception tracebacks they can't print (because stdout is set to ASCII) or filenames they can't open when they're passed in on the command line (because they aren't valid UTF-8).

    • adastra22 a day ago

      Not all text is UTF-8, and there are real world contexts (e.g. Windows) where this matters a lot.

      • const_cast a day ago

        Yes, Windows text is broken in its own special way.

        We can try to shove it into objects that work on other text but this won't work in edge cases.

        Like if I take text on Linux and try to write a Windows file with that text, it's broken. And vice versa.

        Go let's you do the broken thing. In Rust or even using libraries in most languages, you cant. You have to specifically convert between them.

        That's why I mean when I say "storing random binary data as text". Sure, Windows almost UTF16 abomination is kind of text, but not really. Its its own thing. That requires a different type of string OR converting it to a normal string.

roncesvalles a day ago

I've always thought the point of the string type was for indexing. One index of a string is always one character, but characters are sometimes composed of multiple bytes.

  • crazygringo a day ago

    Yup. But to be clear, in Unicode a string will index code points, not characters. E.g. a single emoji can be made of multiple code points, as well as certain characters in certain languages. The Unicode name for a character like this is a "grapheme", and grapheme splitting is so complicated it generally belongs in a dedicated Unicode library, not a general-purpose string object.

  • birn559 a day ago

    You can't do that in a performant way and going that route can lead to problems, because characters (= graphemes in the language of Unicode) generally don't always behave as developers assume.

assbuttbuttass a day ago

string is just an immutable []byte. It's actually one of my favorite things about Go that strings can contain invalid utf-8, so you don't end up with the Rust mess of String vs OSString vs PathBuf vs Vec<u8>. It's all just string

  • zozbot234 a day ago

    Rust &str and String are specifically intended for UTF-8 valid text. If you're working with arbitrary byte sequences, that's what &[u8] and Vec<u8> are for in Rust. It's not a "mess", it's just different from what Golang does.

    • gf000 a day ago

      If anything that will make Rust programs likely to be correct under any strange text input, while Go might just handle the happy path of ASCII inputs.

      Stuff like this matters a great deal on the standard library level.

      • [removed] a day ago
        [deleted]
    • maxdamantus a day ago

      It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

      You should always be able to iterate the code points of a string, whether or not it's valid Unicode. The iterator can either silently replace any errors with replacement characters, or denote the errors by returning eg, `Result<char, Utf8Error>`, depending on the use case.

      All languages that have tried restricting Unicode afaik have ended up adding workarounds for the fact that real world "text" sometimes has encoding errors and it's often better to just preserve the errors instead of corrupting the data through replacement characters, or just refusing to accept some inputs and crashing the program.

      In Rust there's bstr/ByteStr (currently being added to std), awkward having to decide which string type to use.

      In Python there's PEP-383/"surrogateescape", which works because Python strings are not guaranteed valid (they're potentially ill-formed UTF-32 sequences, with a range restriction). Awkward figuring out when to actually use it.

      In Raku there's UTF8-C8, which is probably the weirdest workaround of all (left as an exercise for the reader to try to understand .. oh, and it also interferes with valid Unicode that's not normalized, because that's another stupid restriction).

      Meanwhile the Unicode standard itself specifies Unicode strings as being sequences of code units [0][1], so Go is one of the few modern languages that actually implements Unicode (8-bit) strings. Note that at least two out of the three inventors of Go also basically invented UTF-8.

      [0] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

      > Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

      [1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

      > Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.

      • xyzzyz a day ago

        The way Rust handles this is perfectly fine. String type promises its contents are valid UTF-8. When you create it from array of bytes, you have three options: 1) ::from_utf8, which will force you to handle invalid UTF-8 error, 2) ::from_utf8_lossy, which will replace invalid code points with replacement character code point, and 3) from_utf8_unchecked, which will not do the validity check and is explicitly marked as unsafe.

      • empath75 a day ago

        > It's never been clear to me where such a type is actually useful. In what cases do you really need to restrict it to valid UTF-8?

        Because 99.999% of the time you want it to be valid and would like an error if it isn't? If you want to work with invalid UTF-8, that should be a deliberate choice.

naikrovek a day ago

I think maybe you've forgotten about the rune type. Rune does make assumptions.

[]Rune is for sequences of UTF characters. rune is an alias for int32. string, I think, is an alias for []byte.

  • TheDong a day ago

    `string` is not an alias for []byte.

    Consider:

        for i, chr := range string([]byte{226, 150, 136, 226, 150, 136}) {
          fmt.Printf("%d = %v\n", i, chr)
          // note, s[i] != chr
        }
    
    How many times does that loop over 6 bytes iterate? The answer is it iterates twice, with i=0 and i=3.

    There's also quite a few standard APIs that behave weirdly if a string is not valid utf-8, which wouldn't be the case if it was just a bag of bytes.