Comment by orangeboats
Comment by orangeboats 13 hours ago
Sometimes it's not just "your code". Strings are often interchanged and sent to many other parties.
And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.
Something like:
1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.
2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".
3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.
(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)
I had a github action with a phrase 'filter: \directory\u02filename.txt' or something close to this and the the filename got interpreted as a utf-8 character rather than a string literal causing the application to throw an error about invalid utf 8 in the path. Had to go about setting it up to quote the strings differently, but you get to see a lot of these issues in the wild.