Comment by zahlman
>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"
"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.
Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)
Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)
Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:
>>> x = '\ud800\udc00'
>>> x
'\ud800\udc00'
>>> print(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
They're even disallowed for an explicit encode to utf-16: >>> x.encode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed
But this can be overridden: >>> x.encode('utf-16-le', 'surrogatepass')
b'\x00\xd8\x00\xdc'
Which subsequently allows for decoding that automatically interprets surrogate pairs: >>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le')
>>> y
'𐀀'
>>> len(y)
1
>>> ord(y)
65536
Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux): $ cat cmdline.py
#!/usr/bin/python
import binascii, sys
for arg in sys.argv[1:]:
abytes = arg.encode(sys.stdin.encoding, 'surrogateescape')
ahex = binascii.hexlify(abytes)
print(ahex.decode('ascii'))
$ ./cmdline.py foo
666f6f
$ ./cmdline.py 日本語
e697a5e69cace8aa9e
$ ./cmdline.py $'\x01\x00\x02'
01
$ ./cmdline.py $'\xff'
ff
$ ./cmdline.py ÿ
c3bf
It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding: >>> b'\xff'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
>>> b'\xff'.decode('utf-8', 'surrogateescape')
'\udcff'