Comment by zahlman

Comment by zahlman 20 hours ago

0 replies

> there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

Yes, yes, the `str` type may contain data that doesn't represent a valid string. I've already explained elsewhere ITT that this is a feature.

And sure, pedantically it should be "UCS-4" rather than UTF-32 in my post, since a str object can be created which contains surrogates. But Python does not use surrogate pairs in representing text. It only stores surrogates, which it considers invalid at encoding time.

Whenever a `str` represents a valid string without surrogates, it will reliably encode. And when bytes are decoded, surrogates are not produced except where explicitly requested for error handling.

> The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.)

Ah.

Good news: since Python doesn't use surrogate pairs to represent valid text, these are the same whenever the `str` contents represent a valid text string in Python. And the cases where they don't, are rare and more or less must be deliberately crafted. You don't even get them from malicious user input, if you process input in obvious ways.

> The Unicode definition of "character" is not a technical definition, it's just there to help humans.

You're missing the point. The facepalm emoji has 5 characters in it. The Unicode Consortium says so. And they are, indisputably, the ones who get to decide what a "character" is in the context of Unicode.

I linked to the glossary on unicode.org. I don't understand how it could get any more official than that.

Or do you know another word for "the thing that an assigned Unicode code point has been assigned to"? cf. also the definition of https://www.unicode.org/glossary/#encoded_character , and note that definition 2 for "character" is "synonym of abstract character".