Comment by arcticbull

Comment by arcticbull a day ago

13 replies

Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

ramses0 a day ago

"Unicode is JPG for ASCII" is an incredibly great metaphor.

size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

account42 a day ago

> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

  • josephg a day ago

    It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

    Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

    Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

    • account42 a day ago

      > Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.

      You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.

      This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

      • josephg a day ago

        I hear your point, but invalid codepoint sequences are way less of a problem than strings with invalid UTF8. Text rendering engines deal with weird Unicode just fine. They have to since Unicode changes over time. Invalid UTF8 on the other hand is completely unrepresentable in most languages. I mean, unless you use raw byte arrays and convert to strings at the edge, but that’s a terrible design.

        > This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

        Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.

  • torstenvl 21 hours ago

    I really wish people would stop giving this bad advice, especially so stridently.

    Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.

    I vehemently dissent from this view.

    • shiomiru 8 hours ago

      > Telling people to ignore code points

      Nobody is saying that, the point is that if you're parsing Unicode by counting codepoints you're doing it wrong. The way you actually parse Unicode text (in 99% of cases) is by iterating through the codepoints, and then the actual count is fairly irrelevant, it's just a stream.

      Other uses of codepoint length are also questionable: for measurement it's useless, for bounds checking (random access) it's inefficient. It may be useful in some edge cases, but TFA's point is that a general purpose language's default string type shouldn't optimize for edge cases.

    • dcrazy 20 hours ago

      You’re arguing against a strawman. The advice wasn’t to ignore learning about code points; it’s that if your solution to a problem involves reasoning about code points, you’re probably doing it wrong and are likely to make a mistake.

      Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).

      Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.

      • torstenvl 17 hours ago

        > You’re arguing against a strawman.

        Ironic.

        > The advice wasn’t to ignore learning about code points

        I didn't say "learning about."

        Look man. People operate at different levels of abstraction, depending on what they're doing.

        If you're doing front-end web dev, sure, don't worry about it. If you're hacking on a text editor in C, then you probably ought to be able to take a string of UTF-8 bytes, decode them into code points, and apply the grapheme clustering algorithm to them, taking into account your heuristics about what the terminal supports. And then probably either printing them to the screen (if it seems like they're supported) or printing out a representation of the code points. So yeah, you kind of have to know.

        So don't sit there and presume to tell others what they should or should not reason about, based solely on what you assume their use case is.

      • [removed] 19 hours ago
        [deleted]
    • eviks 12 hours ago

      > Telling people to ignore code points is telling people to ignore how data works.

      No, it's telling people that they're don't understand how data works otherwise they'd be using a different unit of measurement

  • [removed] a day ago
    [deleted]