Comment by Ygg2

Comment by Ygg2 4 days ago

12 replies

Theoretically yes. Practically there is character escaping.

That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.

deaddodo 4 days ago

I'm confused why this would be a problem. UTF-8 and UTF-16 (the only two common unicode subsets) are a maximum of 4 bytes wide (and, most commonly, 2 in English text). The ASCII representation you gave is 6-bytes wide. I don't know of many ASCII unicode representations that have less bytewidth than their native Unicode representation.

Same goes for other characters such as \n, \0, \t, \r, etc. All half in native byte representation.

lelanthran 4 days ago

> Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate.

Depends on what you are doing with it. If you aren't displaying it (and typically you are not in a server application), you don't need to unescape it.

  • mpyne 3 days ago

    And this is indeed something that the C++ Glaze library supports, to allow for parsing into a string_view pointing into the original input buffer.

_3u10 3 days ago

It’s just two pointers the current place to write and the current place to read, escapes are always more characters than they represent so there’s no danger of overwriting the read pointer. If you support compression this can become somewhat of and issue but you simply support a max block size which is usually defined by the compression algorithm anyway.

  • Ygg2 3 days ago

    If you have a place to write, then it's not zero allocation. You did an allocation.

    And usually if you want maximum performance, buffered read is the way to go, which means you need a write slab allocation.

    • lelanthran 3 days ago

      > If you have a place to write, then it's not zero allocation. You did an allocation.

      Where did that allocation happen? You can write into the buffer you're reading from, because the replacement data is shorter than the original data.

      • Ygg2 2 days ago

        You have a read buffer and somewhere where you have to write to.

        Even if we pretend that the read buffer is not allocating (plausible), you will have to allocate for the write source for the general case (think GiB or TiB of XML or JSON).

topspin 3 days ago

> Practically there is character escaping

The voice of experience appears. Upvoted.

It is conceivable to deal with escaping in-place, and thus remain zero-alloc. It's hideous to think about, but I'll bet someone has done it. Dreams are powerful things.