Comment by jibal

Comment by jibal a day ago

3 replies

"Unicode, being a byte code format"

UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

zahlman a day ago

The point is that not all sequences of characters ("code point" means the integer value, whereas "character" means the thing that number represents) are valid.

  • jibal 19 hours ago

    non sequitur ... I simply pointed out a mistaken claim and your comment is about something quite different.

    (Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.)

    P.S. Everything about the response to this comment is wrong, especially the absurd baseless claim that I misunderstood the claim that I quoted and corrected (that's the only claim I responded to).

    • zahlman 19 hours ago

      > I simply pointed out a mistaken claim and your comment is about something quite different.

      My comment explains that you have misunderstood what the claim is. "Byte code format" was nonsensical (Unicode is not interpreted by a VM), but the point that comment was trying to make (as I understood it) is that not all subsequences of a valid sequence of (assigned) code points are valid.

      > Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.

      My definition does not contradict that. A code point is an integer in the Unicode code space which may correspond to a character. When it does, "character" trivially means the thing that the code point corresponds to, i.e., represents, as I said.