Comment by xigoi

Comment by xigoi a day ago

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

zahlman a day ago

Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.

But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.

Reply View 6 replies

capitainenemo a day ago

My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.
I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

Reply View | 5 replies
- zahlman a day ago
  
  > a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3.
  Show me.
  
  Reply View | 4 replies
  
  capitainenemo 21 hours ago
  
  Heh. It always starts this way... then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing" or, "I could get this working but it isn't worth the effort" but sure, let's do this one more time. Could be they were all missing something obvious - wouldn't know, I avoid python personally, apart from when necessary like with LLM glue. https://pastebin.com/j4Lzb5q1
  This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array.
  
  Reply View | 3 replies

afiori a day ago

I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.

Reply View 2 replies

account42 a day ago

But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.

Reply View | 1 reply
- afiori a day ago
  
  The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.
  My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.
  I think that the current status quo is better than what came before, but I also think it could be improved.
  
  Reply View | 0 replies

bawolff a day ago

Me too.

The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

Non normalized unicode is just as problematic as non validated unicode imo.

Reply View 0 replies

jibal a day ago

Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.

Reply View 0 replies

account42 a day ago

Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.

Reply View 1 reply

astrange a day ago

C strings are not bags of bytes because they can't contain 0x00.

Reply View | 0 replies