Comment by quectophoton

Comment by quectophoton 9 hours ago

47 replies

Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

Animats 9 hours ago

Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.

Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.

I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.

  • nostrademons 8 hours ago

    PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.

    https://peps.python.org/pep-0393/

    I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.

    • barrkel 5 hours ago

      You usually want O(1) indexing when you're implementing views over a large string. For example, a string containing a possibly multi-megabyte text file and you want to avoid copying out of it, and work with slices where possible. Anything from editors to parsing.

      I agree though that usually you only need iteration, but string APIs need to change to return some kind of token that encapsulates both logical and physical index. And you probably want to be able to compute with those - subtract to get length and so on.

      • ori_b 4 hours ago

        You don't particularly want indexing for that, but cursors. A byte offset (wrapped in an opaque type) is sufficient for that need.

      • naniwaduni 3 hours ago

        You really just very rarely want codepoint indexing. A byte index is totally fine for view slices.

      • nostrademons 4 hours ago

        Sure, but for something like that whatever constructs the view can use an opaque index type like Animats suggested, which under the hood is probably a byte index. The slice itself is kinda the opaque index, and then it can just have privileged access to some kind of unsafe_byteIndex accessor.

        There are a variety of reasons why unsafe byte indexing is needed anyway (zero-copy?), it just shouldn’t be the default tool that application programmers reach for.

  • btown 9 hours ago

    This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!

    In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.

  • johncolanduoni 6 hours ago

    Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.

  • kccqzy 6 hours ago

    Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.

    Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.

    • mjevans 4 hours ago

      They can happen to _be_ Unicode. Composition operations (for fully terminated Unicode strings) should work, but require eventual normalization.

      That's the other part of the resume UTF8 strings mid way, even combining broken strings still results in all the good characters present.

      Substring operations are more dicey; those should be operating with known strings. In pathological cases they might operate against portions of Unicode bits... but that's as silly as using raw pointers and directly mangling the bytes without any protection or design plans.

  • zahlman 5 hours ago

    > If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.

    What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.

sparkie 4 hours ago

VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.

The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.

    0xxxxxxx                   - ASCII
    1xxxxxxx 0xxxxxxx          - U+0080 .. U+3FFF
    1xxxxxxx 1xxxxxxx 0xxxxxxx - U+4000 .. U+10FFFD

                      0xxxxxxx - ASCII
             0xxxxxxx 1xxxxxxx - U+0080 .. U+3FFF
    0xxxxxxx 1xxxxxxx 1xxxxxxx - U+4000 .. U+10FFFD
It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.
  • kstenerud 3 hours ago

    Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

      | Header     | Total Bytes | Payload Bits |
      | ---------- | ----------- | ------------ |
      | `.......1` |      1      |       7      |
      | `......10` |      2      |      14      |
      | `.....100` |      3      |      21      |
      | `....1000` |      4      |      28      |
      | `...10000` |      5      |      35      |
      | `..100000` |      6      |      42      |
      | `.1000000` |      7      |      49      |
      | `10000000` |      8      |      56      |
      | `00000000` |      9      |      64      |
    
    The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

    Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

    https://github.com/kstenerud/ksbonjson/blob/main/library/src...

        static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
            return (size_t)__builtin_ctz(header) + 1;
        }
    
    After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

    https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

    https://github.com/kstenerud/bonjson

    If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.

    • sparkie 2 hours ago

      We can do better than one branch per byte - we can have it per 8-bytes at least.

      We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.

      Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.

      [1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

      • kstenerud an hour ago

        Sure, but with the above algorithm you could do it in zero branches, and in parallel if you like.

        • sparkie 42 minutes ago

          Decoding into integers may be faster, but it's kind of missing the point why I suggested VLQs as opposed to EBML's variable length integers - they're not a good fit for string handling. In particular, if we wanted to search for a character or substring we'd have to start from the beginning of the stream and traverse linearly, because there's no synchronization - the payload bytes are indistinguishable from header bytes, making a parallel search not practical.

          While you might be able to have some heuristic to determine whether a character is a valid match, it may give false positives and it's unlikely to be as efficient as "test if the previous byte's MSB is zero". We can implement parallel search with VLQs because we can trivially synchronize the stream to next nearest character in either direction - it's partially-synchronizing.

          Obviously not as good as UTF-8 or UTF-16 which are self-synchronizing, but it can be implemented efficiently and cut encoding size.

deepsun 9 hours ago

That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.

Quick googling (not all of them are on-topic tho):

https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...

https://www.cve.org/CVERecord/SearchResults?query=utf-8

  • s1mplicissimus 7 hours ago

    I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?

    • gavinsyancey 6 hours ago

      Generally you can assume byte-aligned access. So every byte of UTF-8 either starts with 0 or 11 to indicate an initial byte, or 10 to indicate a continuation byte.

    • pklausler 6 hours ago

      10 never implies the start of a character; those begin with 0 or 11.

    • dbaupp 6 hours ago

      UTF-8 encodes each character into a whole number of bytes (8, 16, 24, or 32 bits), and the 10 continuation marker is only at the start of the extra continuation bytes, it is just data when that pattern occurs within a byte.

      You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).

PaulHoule 9 hours ago

It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding

https://en.wikipedia.org/wiki/Unary_numeral_system

and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic

https://archive.org/details/managinggigabyte0000witt

together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.

spankalee 7 hours ago

Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.

I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.

  • Sharlin 3 hours ago

    It’s not always possible to read backwards.

jridgewell 5 hours ago

This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.

  • rockwotj 5 hours ago

    > outputs 3 chars

    You mean codepoints or maybe grapheme clusters?

    Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful

    • jridgewell 5 hours ago

      Yah, I was using char interchangeably with code point. I also used byte instead of code unit.

      > truncate a string without splitting a codepoint in O(1) is still useful

      Agreed!

jancsika 7 hours ago

> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

Given four byte maximum, it's a similarly trivial algo for the other case you mention.

The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).

Encoding gurus-- am I right?

procaryote 8 hours ago

also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings

1oooqooq 9 hours ago

so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.

what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.

  • hk__2 9 hours ago

    What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.

    • gertop 8 hours ago

      UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

      UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.

      When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

      • jcranmer 6 hours ago

        > UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

        UTF-8 and UTF-16 take the same number of characters to encode a non-BMP character or a character in the range U+0080-U+07FF (which includes most of the Latin supplements, Greek, Cyrillic, Arabic, Hebrew, Aramaic, Syriac, and Thaana). For ASCII characters--which includes most whitespace and punctuation--UTF-8 takes half as much space as UTF-16, while characters in the range U+0800-U+FFFF, UTF-8 takes 50% more space than UTF-16. Thus, for most European languages, and even Arabic (which ain't European), UTF-8 is going to be more compact than UTF-16.

        The Asian languages (CJK-based languages, Indic languages, and South-East Asian, largely) are the ones that are more compact in UTF-16 than UTF-8, but if you embed those languages in a context likely to have significant ASCII content--such as an HTML file--well, it turns out the UTF-8 still wins out!

        > When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

        You'll notice that the encodings that are used are not UTF-16 either. Also, my understanding is that China generally defaults to UTF-8 nowadays despite a government mandate to use GB18030 instead, so it's largely Japan that is the last redoubt of the anti-Unicode club.

      • ISV_Damocles 8 hours ago

        UTF-16 is also just as complicated as UTF-8 requiring multibyte characters to cover the entirety of Unicode, so it doesn't avoid the issue you're complaining about for the newest languages added, and it has the added complexity of a BOM being required to be sure you have the pairs of bytes in the right order, so you are more vulnerable to truncated data being unrecoverable versus UTF-8.

        UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

      • adgjlsfhk1 8 hours ago

        With BOM issues, UTF-16 is way more complicated. For Chinese and Japenese, UTF8 is a maximum of 50% bigger, but can actually end up smaller if used within standard file formats like JSON/HTML since all the formatting characters and spaces are single bytes.

      • cyphar 7 hours ago

        UTF-16 is absolutely not easier to work with. The vast majority of bugs I remember having to fix that were directly related to encoding were related to surrogate pairs. I suspect most programs do not handle them correctly because they come up so rarely but the bugs you see are always awful. UTF-8 doesn't have this problem and I think that's enough reason to avoid UTF-16 (though "good enough" compatibility with programs that only understand 8-bit-clean ASCII is an even better practical reason). Byte ordering is also a pernicious problem (with failure modes like "all of my documents are garbled") that UTF-8 also completely avoids.

        It is 33% more compact for most (but not all) CJK characters, but that's not the case for all non-English characters. However, one important thing to remember is that most computer-based documents contain large amounts of ASCII text purely because the formats themselves use English text and ASCII punctuation. I suspect that most UTF-8 files with CJK contents are much smaller than UTF-16 files, but I'd be interested in an actual analysis from different file formats.

        The size argument (along with a lot of understandable contention around UniHan) is one of the reasons why UTF-8 adoption was slower in Japan and Shift-JIS is not completely dead (though mainly for esoteric historical reasons like the 漢検 test rather than active or intentional usage) but this is quite old history at this point. UTF-8 now makes up 99% of web pages.

        • cyphar 5 hours ago

          I went through a Japanese ePUB novel I happened to have on hand (the Japanese translation of 1984) and 65% of the bytes are ASCII bytes. So in this case UTF-16 would end up resulting in something like 53% more bytes (going by napkin math).

          You could argue that because it will be compressed (and UTF-16 wastes a whole NUL byte for all ASCII) that the total file-size for the compressed version would be better (precisely because there are so many wasted bytes) but there are plenty of examples where files aren't compressed and most systems don't have compressed memory so you will pay the cost somewhere.

          But in the interest of transparency, a very crude test of the same ePUB yields a 10% smaller file with UTF-16. I think a 10% size penalty (in a very favourable scenario for UTF-16) in exchange for all of the benefits of UTF-8 is more than an acceptable tradeoff, and the incredibly wide proliferation of UTF-8 implies most people seem to agree.

      • syncsynchalt 7 hours ago

        UTF-16 has endian concerns and surrogates.

        Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.

        • Mikhail_Edoshin 9 minutes ago

          Here is what an UTF-8 decoder needs to handle:

          1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

          2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.

          3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.

          4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

          It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

      • kbolino 7 hours ago

        Thanks to UTF-16, which came out after UTF-8, there are 2048 wasted 3-byte sequences in UTF-8.

        And unlike the short-sighted authors of the first version of Unicode, who thought the whole world's writing systems could fit in just 65,536 distinct values, the authors of UTF-8 made it possible to encode up to 2 billion distinct values in the original design.

      • kccqzy 6 hours ago

        Two decades ago the typical simplified Chinese website did in fact use GB2312 and not UTF-8; traditional Chinese website used Big5; Japanese sites used Shift JIS. These days that's not true at all. Your comment is twenty years out of date.

      • Iwan-Zotow 3 hours ago

        There are no sane Chinese Japanese people who uses old encodings. None

thesz 9 hours ago

> so you can easily find the beginning of the next or previous character.

It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.

[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...