Comment by xg15

Comment by xg15 a day ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull a day ago

Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.

If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

Reply View 6 replies

jibal a day ago

"Unicode, being a byte code format"
UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

Reply View | 3 replies
- zahlman a day ago
  
  The point is that not all sequences of characters ("code point" means the integer value, whereas "character" means the thing that number represents) are valid.
  
  Reply View | 2 replies
  
  jibal 20 hours ago
  
  non sequitur ... I simply pointed out a mistaken claim and your comment is about something quite different.
  (Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.)
  P.S. Everything about the response to this comment is wrong, especially the absurd baseless claim that I misunderstood the claim that I quoted and corrected (that's the only claim I responded to).
  
  Reply View | 1 reply
  
  zahlman 19 hours ago
  
  > I simply pointed out a mistaken claim and your comment is about something quite different.
  My comment explains that you have misunderstood what the claim is. "Byte code format" was nonsensical (Unicode is not interpreted by a VM), but the point that comment was trying to make (as I understood it) is that not all subsequences of a valid sequence of (assigned) code points are valid.
  > Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.
  My definition does not contradict that. A code point is an integer in the Unicode code space which may correspond to a character. When it does, "character" trivially means the thing that the code point corresponds to, i.e., represents, as I said.
  
  Reply View | 0 replies
setr a day ago

I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation
I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly

Reply View | 1 reply
- arcticbull a day ago
  
  Yep for a substring against its parent or other substrings of the same parent that’s definitely true, but I think this question generalizes because the case where you’re comparing strings solely within themselves is an optimization path for the more general. I’m just thinking out loud.
  
  Reply View | 0 replies

account42 a day ago

> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

Reply View 0 replies

mseepgood a day ago

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

Reply View 1 reply

xg15 a day ago

Indeed. Or s.length, whatever that represents.

Reply View | 0 replies