Comment by DavidPiper

Comment by DavidPiper a day ago

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

arcticbull a day ago

Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

Reply View 13 replies

ramses0 a day ago

"Unicode is JPG for ASCII" is an incredibly great metaphor.
size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

Reply View | 0 replies
account42 a day ago

> Number of code points when parsing.
You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

Reply View | 11 replies
- josephg a day ago
  
  It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.
  Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
  Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
  
  Reply View | 3 replies
  
  account42 a day ago
  
  > Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.
  You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
  This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
  
  Reply View | 2 replies
- torstenvl a day ago
  
  I really wish people would stop giving this bad advice, especially so stridently.
  Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.
  I vehemently dissent from this view.
  
  Reply View | 5 replies
  
  shiomiru 9 hours ago
  
  > Telling people to ignore code points
  Nobody is saying that, the point is that if you're parsing Unicode by counting codepoints you're doing it wrong. The way you actually parse Unicode text (in 99% of cases) is by iterating through the codepoints, and then the actual count is fairly irrelevant, it's just a stream.
  Other uses of codepoint length are also questionable: for measurement it's useless, for bounds checking (random access) it's inefficient. It may be useful in some edge cases, but TFA's point is that a general purpose language's default string type shouldn't optimize for edge cases.
  
  Reply View | 0 replies
  
  dcrazy a day ago
  
  You’re arguing against a strawman. The advice wasn’t to ignore learning about code points; it’s that if your solution to a problem involves reasoning about code points, you’re probably doing it wrong and are likely to make a mistake.
  Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).
  Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.
  
  Reply View | 2 replies
  
  eviks 14 hours ago
  
  > Telling people to ignore code points is telling people to ignore how data works.
  No, it's telling people that they're don't understand how data works otherwise they'd be using a different unit of measurement
  
  Reply View | 0 replies
- [removed] a day ago
  
  [deleted]
  
  Reply View | 0 replies

baq a day ago

ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

Reply View 64 replies

craftkiller a day ago
> Notably Rust did the correct thing
In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:
String.len() == number of bytes String.bytes().count() == number of bytes String.chars().count() == number of unicode scalar values String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib) String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.
Reply View | 3 replies
- pron a day ago
  
  Similar to Java:
  String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length
  
  Reply View | 0 replies
- westurner a day ago
  
  String.graphemes().count()
  That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)
  ugrapheme and ucwidth are one way to get the graphene count from a string in Python.
  It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?
  
  Reply View | 1 reply
  
  dhosek a day ago
  
  Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust).
  
  Reply View | 0 replies
account42 a day ago

> in the global international connected computing world it doesn’t fit at all.
I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

Reply View | 29 replies
- andriamanitra 11 hours ago
  
  > For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.
  That's a tradeoff you should carefully consider because there are also downsides to disallowing non-ASCII characters. The downsides of allowing non-ASCII mostly stem from assigning semantic significance to upper/lowercase (which is itself a tradeoff you should consider when designing a language). The other issue I can think of is homographs but it seems to be more of a theoretical concern than a problem you'd run into in practice.
  When I first learned programming I used my native language (Finnish, which uses 3 non-ASCII letters: åäö) not only for strings and comments but also identifiers. Back then UTF-8 was not yet universally adopted (ISO 8859-1 character set was still relatively common) so I occasionally encountered issues that I had no means to understand at the time. As programming is being taught to younger and younger audiences it's not reasonable to expect kids from (insert your favorite non-English speaking country) to know enough English to use it for naming. Naming and, to an extent, thinking in English requires a vocabulary orders of magnitude larger than knowing the keywords.
  By restricting source code to ASCII only you also lose the ability to use domain-specific notation like mathematical symbols/operators and Greek letters. For example in Julia you may use some mathematical operators (eg. ÷ for Euclidean division, ⊻ for exclusive or, ∈/∉/∋ for checking set membership) and I find it really makes code more pleasant to read.
  
  Reply View | 0 replies
- eviks 14 hours ago
  
  The "nothing wrong" is, of course, this huge issue of not being able to use your native language, especially important when learning something by avoiding the extra language barrier on top of another language barrier
  Now list anything as important from your list of downsides that's just as unfixable
  
  Reply View | 0 replies
- simonask a day ago
  
  This is American imperialism at its worst. I'm serious.
  Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.
  Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?
  It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.
  
  Reply View | 26 replies
  
  0x000xca0xfe a day ago
  
  Well I'm not American and I can tell you that we do not see English source code as imperialism.
  In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.
  I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it.
  Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed.
  
  Reply View | 9 replies
  
  jibal a day ago
  
  It's neither American nor imperialism -- those are both category mistakes.
  Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.
  
  Reply View | 4 replies
  
  account42 a day ago
  
  Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.
  
  Reply View | 4 replies
  
  flohofwoe a day ago
  
  Calm down, ASCII is a UNICODE compatible encoding for the first 127 UNICODE code points (which maps directly to the entire ASCII range). If you need to go beyond that, just 'upgrade' to UTF-8 encoding.
  UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII).
  Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage).
  
  Reply View | 0 replies
  
  ksenzee a day ago
  
  I don’t see much difference between the amount of Italian you need for music and the amount of English you need for programming. You can have a conversation about it in your native language, but you’ll be using a bunch of domain-specific terms that may not be in your native language.
  
  Reply View | 1 reply
  
  simonask 7 hours ago
  
  I agree, but we're talking about identifiers in code you write yourself here. Not the limited vocabulary of keywords, which are easy to memorize in any language. Standard libraries may trip you up, but documentation for those may be available in your native language.
  
  Reply View | 0 replies
  
  nkrisc a day ago
  
  There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin.
  
  Reply View | 2 replies
bigstrat2003 a day ago

> in the global international connected computing world it doesn’t fit at all.
Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.

Reply View | 5 replies
- arp242 a day ago
  
  No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't.
  Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).
  There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).
  The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.
  
  Reply View | 0 replies
- rileymat2 a day ago
  
  Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste.
  
  Reply View | 0 replies
- wat10000 21 hours ago
  
  Which audience makes it so you don’t have to worry about text encodings?
  
  Reply View | 0 replies
- raverbashing a day ago
  
  This is naive at best
  Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.
  It's a much simpler problem and still tripped a lot of people
  And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...
  
  Reply View | 1 reply
  
  ryandrake a day ago
  
  Yea, I cringe when I hear the phrase "special characters." They're only special because you, the developer, decided to treat them as special, and that's almost surely going to come back to haunt you at some point in the form of a bug.
  
  Reply View | 0 replies
flohofwoe a day ago

ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.
Just never ever use Extended ASCII (8-bits with codepages).

Reply View | 0 replies
[removed] a day ago

[deleted]

Reply View | 0 replies
eru a day ago

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

Reply View | 21 replies
- ynik a day ago
  
  Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).
  
  Reply View | 5 replies
  
  arcticbull a day ago
  
  Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.
  It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.
  It would be pretty silly for them to explode all strings to 4-byte characters.
  
  Reply View | 4 replies
- xigoi a day ago
  
  I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.
  
  Reply View | 14 replies
  
  zahlman a day ago
  
  Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.
  But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.
  
  Reply View | 6 replies
  
  afiori a day ago
  
  I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.
  
  Reply View | 2 replies
  
  bawolff a day ago
  
  Me too.
  The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.
  Non normalized unicode is just as problematic as non validated unicode imo.
  
  Reply View | 0 replies
  
  jibal a day ago
  
  Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.
  
  Reply View | 0 replies
  
  account42 a day ago
  
  Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.
  
  Reply View | 1 reply
  
  astrange a day ago
  
  C strings are not bags of bytes because they can't contain 0x00.
  
  Reply View | 0 replies

xelxebar a day ago

> Number of monospaced font character blocks this string will take up on the screen

Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

Reply View 0 replies

xg15 a day ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

Reply View 10 replies

arcticbull a day ago

Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.
If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

Reply View | 6 replies
- jibal a day ago
  
  "Unicode, being a byte code format"
  UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.
  
  Reply View | 3 replies
  
  zahlman a day ago
  
  The point is that not all sequences of characters ("code point" means the integer value, whereas "character" means the thing that number represents) are valid.
  
  Reply View | 2 replies
- setr a day ago
  
  I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation
  I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly
  
  Reply View | 1 reply
  
  arcticbull a day ago
  
  Yep for a substring against its parent or other substrings of the same parent that’s definitely true, but I think this question generalizes because the case where you’re comparing strings solely within themselves is an optimization path for the more general. I’m just thinking out loud.
  
  Reply View | 0 replies
account42 a day ago

> s.charAt(x) or s.codePointAt(x)
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

Reply View | 0 replies
mseepgood a day ago

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

Reply View | 1 reply
- xg15 a day ago
  
  Indeed. Or s.length, whatever that represents.
  
  Reply View | 0 replies

jlarocco a day ago

It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.

The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.

Reply View 0 replies

perching_aix 19 hours ago

> Number of monospaced font character blocks this string will take up on the screen

To predict the pixel width of a given text, right?

One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.

I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.

Reply View 2 replies

oefrha 18 hours ago

(Some but not all) terminal emulators are capable of rendering CJK perfectly aligned with Latin even when mixing fonts. Browsers are fundamentally incapable of that because aligning characters in different fonts wasn’t a goal at all. VS Code being a webview under the hood means it inherited this fundamental incapability.* Therefore, don’t hold your breath.
* I'm talking about the DOM route, not <canvas> obviously. VS Code is powere by Monaco, which is DOM-based, not canvas-based. You can "Developer: Toggle Developer Tools" to see the DOM structure under the hood.
** I should further qualify my statement as browsers are fundamentally incapable of this if you use native text node rendering. I have built a perfectly monospace mixed CJK and Latin interface myself by wrapping each full width character in a separate span. Not exactly a performance-oriented solution. Also IIRC Safari doesn’t handle lengths in fractional pixels very well.

Reply View | 1 reply
- perching_aix 18 hours ago
  
  That's very informative, thanks! I guess it really wasn't a mirage or me messing up in the end then.
  
  Reply View | 0 replies

guappa a day ago

What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?

Reply View 19 replies

xigoi a day ago

In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.

Reply View | 2 replies
- guappa a day ago
  
  No no, I want to create tomorrow's puzzle.
  
  Reply View | 1 reply
  
  tomsmeding a day ago
  
  As the parent said:
  > In the case of Wordle, you know the exact set of letters you’re going to be using
  This holds for the generator side too. In fact, you have a fixed word list, and the fixed alphabet tells you what a "letter" is, and thus how to compute length. Because this concerns natural language, this will coincide with grapheme clusters, and with English Wordle, that will in turn correspond to byte length because it won't give you words with é (I think). In different languages the grapheme clusters might be larger than 1 byte (e.g. [1], where they're codepoints).
  
  Reply View | 0 replies
taneq a day ago

If you're playing at this level, you need to define:
- letter
- word
- 5 :P

Reply View | 15 replies
- guappa a day ago
  
  Eh in macedonian they have some letters that in russian are just 2 separate letters
  
  Reply View | 14 replies
  
  CorrectHorseBat a day ago
  
  In German you have the same, only within one language. ß can be written as ss if it isn't available in a font, and only in 2017 they added a capital version. So depending the font and the unicode version the number of letters can differ.
  
  Reply View | 11 replies
  
  int_19h 19 hours ago
  
  That's not really any different than the distinction (or lack thereof) between "ae" and "æ". For that matter, in Russian there is a letter "ы" which is historically a digraph consisting of two separately letters "ъ" and "i" that just happens to be treated as a single letter for so long that few people would even recognize it as a digraph. This kind of stuff is all language-specific, which is why for Worlde etc you always need to be aware of the context, and this context will then unambiguously decide what constitutes a single letter.
  
  Reply View | 0 replies
  
  taneq a day ago
  
  Niße. ;)
  
  Reply View | 0 replies

Semaphor a day ago

FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.

Reply View 1 reply

tomsmeding a day ago

It's not unlikely that what you would ideally use here is the number of grapheme clusters. What is the length of "ë"? Either 1 or 2 codepoints depending on the encoding (combining [1] or single codepoint [2]), and either 1 byte (Latin-1), 2 bytes (UTF-8 single-codepoint) or 3 bytes (UTF-8 combining).
The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.
[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB

Reply View | 0 replies

capitainenemo a day ago

FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length

Reply View 0 replies

TZubiri 19 hours ago

How about for iterating every character in a string in order to find a specific character combination? I need (or the iterator needs) to know the length of the string and what the boundaries of each characters are.

Reply View 0 replies

bluecalm a day ago

What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?

Reply View 2 replies

account42 a day ago

With UTF-8 you can implement them on top of bytes.

Reply View | 1 reply
- jlarocco a day ago
  
  That's basically what a string data type is for.
  
  Reply View | 0 replies

zwnow a day ago

I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.

Reply View 11 replies

dwb a day ago

The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.

Reply View | 0 replies
jibal a day ago

The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.

Reply View | 8 replies
- zwnow a day ago
  
  [flagged]
  
  Reply View | 7 replies
  
  zahlman a day ago
  
  It is not possible to write correct code without understanding what you dismiss here.
  
  Reply View | 0 replies
  
  tomsmeding a day ago
  
  If the validation rules don't specify (either explicitly or implicitly) what the "length" in the rule corresponds to (if it concerns a natural-language field, it's probably grapheme clusters), then either you should fix the rule, or you care only about checking the box of "I checked the validation rules", in which case it's a people problem and not a technology problem.
  
  Reply View | 0 replies
  
  dwb a day ago
  
  You are in the wrong job if you don’t want to think about “nerd shit” while programming.
  
  Reply View | 4 replies
int_19h 19 hours ago

Humans speak many different languages. Not all of them are English, and not all of them have writing systems which make it meaningful to talk about "string length" without disambiguating further.

Reply View | 0 replies

thrdbndndn a day ago

I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

Reply View 1 reply

account42 a day ago

It's not rare at all - multi-code point emojis are pretty standard these days.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.

Reply View | 0 replies

bigstrat2003 a day ago

I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.

Reply View 10 replies

zahlman a day ago

> I have, on the other hand, always wanted the string length.
In an environment that supports advanced Unicode features, what exactly do you do with the string length?

Reply View | 7 replies
- PapaPalpatine 20 hours ago
  
  I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.
  I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
  This seems to have always been known as the length of the string.
  This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.
  
  Reply View | 6 replies
  
  crazygringo 3 hours ago
  
  Practically speaking, for maximum lengths, you generally want to limit code points or bytes, not characters. You don't want to allow some ZALGO monstrosity in a password that is 5 characters but 500 bytes.
  For exact lengths, you often have a restricted character set (like for phone numbers) and can validate both characters and length with a regex. Or the length in bytes works for 0–9.
  Unless you're involved in text layout, you actually usually don't wind up needing the exact length in characters of arbitrary UTF-8 text.
  
  Reply View | 0 replies
  
  int_19h 19 hours ago
  
  If you restrict the input to ASCII, then it makes sense to talk about "string length" in this manner. But we're not talking about Unicode strings at all then.
  If you do allow Unicode characters in whatever it is you're validating, then your approach is almost certainly wrong for some valid input.
  
  Reply View | 0 replies
  
  zahlman 19 hours ago
  
  > I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
  > This seems to have always been known as the length of the string.
  Sure. And by this definition, the string discussed in TFA (that consists of a facepalm emoji with a skin tone set) objectively has 5 characters in it, and therefore a length of 5. And it has always had 5 characters in it, since it was first possible to create such a string.
  Similarly, "é" has one character in it, but "é" has two despite appearing visually identical. Furthermore, those two strings will not compare equal in any sane programming language without explicit normalization (unless HN's software has normalized them already). If you allow passwords or email addresses to contain things like this, then you have to reckon with that brute fact.
  None of this is new. These things have fundamentally been true since the introduction of Unicode in 1991.
  
  Reply View | 3 replies
wredcoll 20 hours ago

Which length? Bytes? Code points? Graphemes? Pixels?

Reply View | 0 replies
justsomehnguy 19 hours ago

Guessing from the other comments you missed the byte length for the codepoints.
When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.

Reply View | 0 replies

sigmoid10 a day ago

I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.

Reply View 0 replies