Comment by ComputerGuru

Comment by ComputerGuru 3 days ago

18 replies

Rust is missing an abstraction over non-contiguous chunks of contiguous allocations of data that would make handling ropes seamless and more natural even for smaller sizes.

C# has the concept of “Sequences” which is basically a generalization of a deque with associated classes and apis such as ReadOnlySequence and SequenceReader to encourage reduced allocations, reuse of existing buffers/slices even for composition, etc

Knowing the rust community, I wouldn’t be surprised if there’s already an RFC for something like this.

gpm 3 days ago

I think you might be looking for the bytes crate, which is pretty widely used in networking code: https://docs.rs/bytes/latest/bytes/index.html

In general this sort of structure is the sort of thing I'd expect to see in an external crate in rust, not the standard library. So it's unlikely there's any RFCs, and more likely there's a few competing implementations lying around.

  • zamalek 3 days ago

    Bytes is essentially multiple slices over a optimistically single contiguous arc buffer. It's basically the inverse of what the root comment is after (an array of buffers). It's a rather strange crate because network IO doesn't actually need contiguous memory.

    std does actually have a vague version of what the root comment wants: https://doc.rust-lang.org/std/io/struct.IoSlice.html and its sibling IoSliceMut (slicing, appending, inserting, etc. is out of scope for both - so not usable for rope stuff)

    • Arnavion 3 days ago

      The bytes crate does support what ComputerGuru asked for via the Buf trait. The trait can be implemented over a sequence of buffers but still provides functions that are common with single buffers. For example the hyper crate uses the trait in exactly this way - it has an internal type that is a VecDeque of chunks but also implements the Buf trait.

      https://docs.rs/bytes/1.9.0/bytes/buf/trait.Buf.html

      https://github.com/hyperium/hyper/blob/3817a79b213f840302d7e...

    • derefr 3 days ago

      > It's a rather strange crate because network IO doesn't actually need contiguous memory.

      Network IO doesn't need contiguous memory, no, but each side of the duplex kind of benefits from it in its own way:

      1. on receive, you can treat a contiguous received network datagram as its own little memory arena — write code that sends sliced references to the contents of the datagram to other threads to work with, where those references keep the datagram arena itself alive for as long as it's being worked with; and then drop the whole thing when the handling of the datagram is complete.

      (This is somewhat akin to the Erlang approach — where the received message is a globally-shared binary; it gets passed by refcount into an actor started just for handling that request; that actor is spawned with its own preallocated memory arena; into that arena, the actor spits any temporaries related to copying/munging the slices of the shared binary, without having to grow the arena; the actor quickly finishes and dies; the arena is deallocated without ever having had to GC, and the refcount of the shared binary goes to zero — unless non-copied slices of it were async-forwarded to other actors for further processing.)

      Also note that the whole premise here is zero-copy networking (as the bytes docs say: https://docs.rs/bytes/1.9.0/bytes/#bytes). The "message" being received here isn't a copy of the one from the network card, but literally the same physical wired memory the PHY sees as being part of its IO ring-buffer — just also mapped into your process's memory on (zero-copy) receive. If this data came chunked, you'd need to copy some of it to assemble those chunks into a contiguous string or data structure. But since it arrives contiguously, you can just slice it, and cast the resulting slice into whatever type you like.

      2. on send — presuming you're doing non-blocking IO — it's nice to once again have a preallocated arena into which you can write out byte-sequences before flinging them at the kernel as [vectors of] large, contiguous DMA requests, without having to stop to allocate. (This removes the CPU as a bottleneck from IO performance — think writev(2).)

      The ideal design here is that you allocate fixed-sized refcounted buffers; fill them up until the next thing you want to write doesn't fit†; and then intentionally drop the current buffer, switching your write_arena reference to point to a freshly-allocated buffer; and repeating. Each buffer then lives until all its slice-references get consumed. This forms kind of a "memory-lifetime-managed buffer-persisted message queue" — with the backing buffers of your messages living until all the messages held in them get "ACKed" [i.e. dropped by the receiving threads.]

      Also, rather than having the buffers deallocate when you "use them up" — requiring you to allocate the next time you need a buffer — you can instead have the buffer's destructor release the memory it's holding into a buffer pool; and then have your next-buffer-please logic pull from that pool in preference to allocating. But then you'll want a higher-level "writable stream that is actually a mempool + current write_arena reference" type. (Hey, that's BufMut!)

      † And at that point, when the next message doesn't fit, you do not split the message. That violates the whole premise of vectorizing the writes. Instead, you leave some of the buffer unused, and push the large message into a fresh buffer, so that the message will still correspond to a single vectorized-write element / io_uring call / DMA request / etc. If the message is so large it won't fit in your default buffer size, you allocate a buffer just for that one message, or better yet, you utilize a special second pool of larger fixed-size buffers. "Jumbo" buffers, per se.

      (Get it yet? Networking hardware is also doing exactly what I'm describing here to pack and unpack your packets into frames. For a NIC or switch, the buffers are the [bodies of the] frames; a jumbo buffer is an Ethernet jumbo frame; and so on.)

      • zamalek 3 days ago

        > Get it yet

        I'm not sure if your comment was meant to be condescending, but it really does come across at that. I'm very well versed in this domain.

        Having a per-request/connection arena isn't the only option. What I have seen/use, which is still zero copy (as far as IO zero copy can be in Rust without resorting to bytemuck/blittable types), is to have a pool of buffers of a specific length - typically page-sized by default and definitely page-aligned. These buffers can come from a single large contiguous allocation. If you run out of space in a buffer you grab a new/reused one from the pool, add it to your vec of buffers, and carry on. At the end of the story you would use vectored IO to submit all of them at once - all the way down to the NIC and everything.

        This approach is more widespread mainly due to historical reasons: it's really easy to fragment 32bit address space, so allocating jumbo buffers simply wasn't an option if you didn't want your server OOMing with 1GB of available (but non-contiguous) memory.

        https://man7.org/linux/man-pages/man3/iovec.3type.html

        https://learn.microsoft.com/en-us/windows/win32/api/ws2def/n...

        • derefr 3 days ago

          > I'm very well versed in this domain.

          Apologies, I wasn't really responding to you directly; I was just taking the opportunity to write an educational-blog-post-as-comment aimed at the average HN reader (who has likely never considered what an Ethernet frame even is, or how a device that uses what are essentially DSPs does TDM packet scheduling) — with your comment being the parent because it's the necessary prerequisite reading to motivate the lesson.

          > Having a per-request/connection arena isn't the only option. What I have seen/use, which is still zero copy (as far as IO zero copy can be in Rust without resorting to bytemuck/blittable types), is to have a pool of buffers of a specific length - typically page-sized by default and definitely page-aligned. These buffers can come from a single large contiguous allocation. If you run out of space in a buffer you grab a new/reused one from the pool, add it to your vec of buffers, and carry on. At the end of the story you would use vectored IO to submit all of them at once - all the way down to the NIC and everything.

          I think you're focusing too much on the word "arena" here, because AFAICT we're both describing the same concept.

          In your model (closer to the one used in actual switching), there's a single global buffer pool that all concurrent requests lease from; in my model, there's global heap memory, and then a per-thread/actor/buf-object elastic buffer pool that allocates from the global heap every once in a while, but otherwise reuses buffers internally.

          I would say that your model is probably the one used in most zero-copy networking frameworks like DPDK. While my model is probably the one used in most language runtimes — especially managed + garbage-collected runtimes, where contending over a global language-exposed pool, can be more expensive than "allocating" (especially when the runtime has its own buffer pool and "allocation" rarely hits the kernel.)

          But both models are essentially the same from the perspective of someone using the buffer ADT and trying to understand why it's designed the way it is, what it gets them, etc. :)

          > it's really easy to fragment 32bit address space, so allocating jumbo buffers simply wasn't an option if you didn't want your server OOMing with 1GB of available (but non-contiguous) memory.

          Maybe you're imagining something else here, but when I say "jumbo buffer", I don't mean custom buffers allocated on demand and right-sized to hold one message; rather, I'm speaking of something very closely resembling actual jumbo frames — i.e. another pre-allocated pool containing a smaller number of larger, fixed-size MTU-slot buffers.

          With this kind of jumbo-buffer-pool, when your messages get big, you switch over from filling regular buffers to filling jumbo buffers — which holds off message fragmentation, but also means new messages go "out the door" a bit slower, maybe "platoon" a bit and potentially overwhelm the recipient with each burst, etc (which is why you don't just use the larger buffer pool as the only pool.)

          But if your messages can be bigger than your set jumbo-buffer size, then there's nowhere to go from there; you still need to have a way to split messages across frames.

          (Luckily, in the case of `bytes`, splitting a message across frames just means the message now needs multiple iovec-list entries to submit, rather than implying a framing protocol / L2 message encoding with a continuation marker / sequence ID / etc.)

      • BeeOnRope 3 days ago

        How does bytes crate, or anyone else, offer zero copy receive from kernel (as opposed to kernel bypass) sockets?

        As far as I know that is not possible: there's always a copy.

    • cmrdporcupine 3 days ago

      Yah I'd Bytes' chief use is avoiding copies when dealing with distinct portions of (contiguous) buffers.

      It is not a tool for composing disparate pieces into one (while avoiding copies)

  • [removed] 3 days ago
    [deleted]
caconym_ 3 days ago

I wrote a utf-8 capable (but also fully generic over element type) rope implementation in Rust last fall (edit: 2023) and the main issue I ran into was the lack of a suitable regex library capable of working across slice boundaries. With some finagling I did manage to get it to work with most/all of the other relevant iterator/reader traits IIRC, and it benchmarked fairly well from a practical perspective, though it's not as fast as some of the other explicitly performance-focused implementations out there.

I'm afraid I might not have that much free time again for a long time, but maybe when I do, somebody will have solved the regex issue for me...

deathanatos 3 days ago

Hmm. It's similar to, but not fully, a `BufRead`? Maybe a `BufRead + Seek`. The slicing ability isn't really covered by those traits, though, but I think you could wrap a BufRead+Seek in something that effectively slices it.

A `BufRead + Seek` need not be backed by memory, though, except in the midst of being read. (A buffered normal file implements `BufRead + Seek`, for example.)

I feel like either Iterator or in some rare case of requiring generic indexing, Index, are more important than "it is composed of some number of linked memory allocations"?

A ReadOnlySequence seems to imply a linked-list of memory sections though; I'm not sure a good rope is going to be able to non-trivially interface with that, since the rope is a tree; walking the nodes in sequence is possible, but it's a tree walk, and something like ReadOnlySequenceSegment::Next() is then a bit tricky. (You could gather the set of nodes into an array ahead of time, but now merely turning it into that is O(nodes) which is sad.)

(And while it might be tempting to say "have the leaf nodes be a LL", I don't think you want to, as it means that inserts need to adjust those links, and I think you would rather have mutations produce a cheaply made but entirely new tree, which I don't think permits a LL of the leafs. You want this to make undo/redo cheap: it's just "go back to the last rope", and then all the ropes share the underlying character data that's not changing rope to rope. The rope in the OP seems to support this: "Cloning ropes is extremely cheap. Rope clones share data,")