Comment by bigstrat2003

Comment by bigstrat2003 a day ago

10 replies

I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.

zahlman a day ago

> I have, on the other hand, always wanted the string length.

In an environment that supports advanced Unicode features, what exactly do you do with the string length?

  • PapaPalpatine 18 hours ago

    I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.

    I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.

    This seems to have always been known as the length of the string.

    This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.

    • crazygringo an hour ago

      Practically speaking, for maximum lengths, you generally want to limit code points or bytes, not characters. You don't want to allow some ZALGO monstrosity that is 5 characters but 500 bytes.

      For exact lengths, you often have a restricted character set (like for phone numbers) and can validate both characters and length with a regex. Or the length in bytes works for 0–9.

      Unless you're involved in text layout, you actually usually don't wind up needing the exact length in characters of arbitrary UTF-8 text.

    • int_19h 17 hours ago

      If you restrict the input to ASCII, then it makes sense to talk about "string length" in this manner. But we're not talking about Unicode strings at all then.

      If you do allow Unicode characters in whatever it is you're validating, then your approach is almost certainly wrong for some valid input.

    • zahlman 18 hours ago

      > I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.

      > This seems to have always been known as the length of the string.

      Sure. And by this definition, the string discussed in TFA (that consists of a facepalm emoji with a skin tone set) objectively has 5 characters in it, and therefore a length of 5. And it has always had 5 characters in it, since it was first possible to create such a string.

      Similarly, "é" has one character in it, but "é" has two despite appearing visually identical. Furthermore, those two strings will not compare equal in any sane programming language without explicit normalization (unless HN's software has normalized them already). If you allow passwords or email addresses to contain things like this, then you have to reckon with that brute fact.

      None of this is new. These things have fundamentally been true since the introduction of Unicode in 1991.

      • kiitos 17 hours ago

        "character" is not a well defined concept in the context of this discussion

        do you mean "byte"? or "rune"?

wredcoll 19 hours ago

Which length? Bytes? Code points? Graphemes? Pixels?

justsomehnguy 17 hours ago

Guessing from the other comments you missed the byte length for the codepoints.

When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.