Comment by craftkiller

Comment by craftkiller a day ago

3 replies

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.
pron a day ago

Similar to Java:

   String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length
westurner a day ago

  String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)

ugrapheme and ucwidth are one way to get the graphene count from a string in Python.

It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?

  • dhosek a day ago

    Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust).