josephg a day ago

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

  • account42 a day ago

    > Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.

    You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.

    This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

    • josephg a day ago

      I hear your point, but invalid codepoint sequences are way less of a problem than strings with invalid UTF8. Text rendering engines deal with weird Unicode just fine. They have to since Unicode changes over time. Invalid UTF8 on the other hand is completely unrepresentable in most languages. I mean, unless you use raw byte arrays and convert to strings at the edge, but that’s a terrible design.

      > This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

      Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.

torstenvl a day ago

I really wish people would stop giving this bad advice, especially so stridently.

Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.

I vehemently dissent from this view.

  • shiomiru 9 hours ago

    > Telling people to ignore code points

    Nobody is saying that, the point is that if you're parsing Unicode by counting codepoints you're doing it wrong. The way you actually parse Unicode text (in 99% of cases) is by iterating through the codepoints, and then the actual count is fairly irrelevant, it's just a stream.

    Other uses of codepoint length are also questionable: for measurement it's useless, for bounds checking (random access) it's inefficient. It may be useful in some edge cases, but TFA's point is that a general purpose language's default string type shouldn't optimize for edge cases.

  • dcrazy a day ago

    You’re arguing against a strawman. The advice wasn’t to ignore learning about code points; it’s that if your solution to a problem involves reasoning about code points, you’re probably doing it wrong and are likely to make a mistake.

    Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).

    Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.

    • torstenvl 18 hours ago

      > You’re arguing against a strawman.

      Ironic.

      > The advice wasn’t to ignore learning about code points

      I didn't say "learning about."

      Look man. People operate at different levels of abstraction, depending on what they're doing.

      If you're doing front-end web dev, sure, don't worry about it. If you're hacking on a text editor in C, then you probably ought to be able to take a string of UTF-8 bytes, decode them into code points, and apply the grapheme clustering algorithm to them, taking into account your heuristics about what the terminal supports. And then probably either printing them to the screen (if it seems like they're supported) or printing out a representation of the code points. So yeah, you kind of have to know.

      So don't sit there and presume to tell others what they should or should not reason about, based solely on what you assume their use case is.

    • [removed] 20 hours ago
      [deleted]
  • eviks 14 hours ago

    > Telling people to ignore code points is telling people to ignore how data works.

    No, it's telling people that they're don't understand how data works otherwise they'd be using a different unit of measurement