Comment by tialaramex

Comment by tialaramex a day ago

I don't think Rust gets this horribly wrong. &str is some bytes which we've agreed are UTF-8 encoded text. So, it's not a sequence of graphemes, though it does promise that it could be interpreted that way, and it is a sequence of bytes but not just any bytes.

In Rust "AbcdeF"[1] isn't a thing, it won't compile, but "AbcdeF"[1..=1] says we want the UTF-8 substring starting from byte 1 through to byte 1 and that compiles, and it'll work because that string does have a valid UTF-8 substring there, it's "b" -- However it'll panic if we try to "€300"[1..=1] because that's no longer a valid UTF-8 substring, that's nonsense.

For app dev this is too low level, but it's nice to have a string abstraction that's at home on a small embedded device where it doesn't matter that I can interpret flags, or an emoji with appropriate skin tones, or whatever else as a distinct single grapheme in Unicode, but we would like to do a bit better than "Only ASCII works in this device" in 2025.

Someone a day ago

> I don't think Rust gets this horribly wrong

> In Rust "AbcdeF"[1] isn't a thing, it won't compile, but "AbcdeF"[1..=1] says we want the UTF-8 substring starting from byte 1 through to byte 1 and that compiles, and it'll work because that string does have a valid UTF-8 substring there, it's "b" -- However it'll panic if we try to "€300"[1..=1]

I disagree. IMO, an API that uses byte offsets to substring on Unicode code points (or even larger units?) already is a bad idea, but then, having it panic when the byte offsets do not happen to be code point/(extended) grapheme cluster boundaries?

How are you supposed to use that when, as you say ”we would like to do a bit better than "Only ASCII works in this device" in 2025”?

I see there’s a better API that doesn’t throw (https://doc.rust-lang.org/std/primitive.str.html#method.get), but that IMO, still isn’t as nice as Swift’s choice because it still uses byte offsets

Reply View 1 reply

tialaramex a day ago

> How are you supposed to use that [...]?
It's often the case that we know where a substring we want starts and ends, so this operation makes sense - because we know there's a valid substring this won't panic. For example if we know there's a literal colon at bytes 17 and 39 in our string foo, foo[18..39] is the UTF-8 text from bytes 18 to 38 inclusive, representing the string between those colons.
One source of confusion here, is not realising that UTF-8 is a self-synchronising encoding. There are a lot of tricks that are correct and fast with UTF-8 but would be a disaster in the other multi-byte encodings or if (which is never the case in Rust) this isn't actually a UTF-8 string.

Reply View | 0 replies

zzo38computer a day ago

You can do better than "only ASCII works in this device", and making the default string type to be Unicode is the wrong way to do that. For some applications, you might not need to interpret text at all, or you might need to only interpret ASCII text even if the text is not necessarily purely ASCII; other times you will want to do other things, but Unicode is not a very good character set (there are others but what is appropriate will depend much on the specific application in use; sometimes none are appropriate), and even if you are using Unicode you still don't need a Unicode string type, and you do not need it to check for valid UTF-8 for every string operation by default, because that will result in inefficiency.

Reply View 1 reply

tialaramex 20 hours ago

In 1995 what you describe isn't crazy. Who knows if this "Unicode" will go anywhere.
In 2005 it's rather old-fashioned. There's lots of 8859-1 and cp1252 out there but people aren't making so much of it, and Unicode aka 10646 is clearly the future.
In 2015 it's a done deal.
Here we are in 2025. Stop treating non-Unicode text as anything other than an aberration.
You don't need checks "for every string operation". You need a properly designed string type.

Reply View | 0 replies