Comment by pron

Comment by pron 16 hours ago

6 replies

> The RAM trade off is excellent for normal sizes but if you scale enormously the trade off eventually reverses

I don't think you've watched the talk. The minimal RAM-per-core is quite high, and often sits there unused even though it could be used to reduce the usage of the more expensive CPU. You pay for RAM that you could use to reduce CPU utilisation and then don't use it. What you want to aim for is a RAM/CPU usage that matches the RAM/CPU ratio on the machine, as that's what you pay for. Doubling the CPU often doubles your cost, but doubling RAM costs much less than that (5-15%).

If two implementations of an algorithm use different amounts of memory (assuming they're reasonable implementations), then the one using less memory has to use more CPU (e.g. it could be compressing the memory or freeing and reusing it more frequently). Using more CPU to save on memory that you've already paid for is just wasteful.

Another way to think about it is consider the extreme case (although it works for any interim value) where a program, say a short-running one, uses 100% of the CPU. While that program runs, no other program can use the machine, anyway, so if you don't use up to 100% of the machine's RAM to reduce the program's duration, then you're wasting it.

As the talk says, it's hard to find less than 1GB per core, so if a program uses computational resources that correspond to a full core yet uses less than 1GB, it's wasteful in the sense that it's spending more of a more expensive resource to save on a less expensive one. The same applies if it uses 50% of a core and less than 500MB of RAM.

Of course, if you're looking at kernels or drivers or VMs or some sorts of agents - things that are effectively pure overhead (rather than direct business value) - then their economics could be different.

> Second thing though: Unpredictability. GC means you can't be sure when reclamation happens.

What you say may have been true with older generations of GCs (or even something like Go's GC, which is basically Java's old CMS, recently removed after two newer GC generations). OpenJDK's current GCs, like ZGC, do zero work in stop-the-world pauses. Their work is more evenly spread out and predictable, and even their latency is more predictable than what you'd get with something like Rust's reference-counting GC. C#'s GC isn't that stellar either, but most important server-side software uses Java, anyway.

The one area where manual memory management still beats the efficiency of a modern tracing GC (although maybe not for long) is when there's a very regular memory usage pattern through the use of arenas, which is another reason why I find Zig particularly interesting - it's most powerful where modern GCs are weakest.

By design or happy accident, Zig is very focused on where the problems are: the biggest security issue for low-level languages is out-of-bounds access, and Zig focuses on that; the biggest shortcoming of modern tracing GCs is arena-like memory usage, and Zig focuses on that. When it comes to the importance of UAF, compilation times, and language complexity, I think the jury is still out, and Rust and Zig obviously make very different tradeoffs here. Zig's bottom-line impact, like that of Rust, may still be too low for widespread adoption, but at least I find it more interesting.

> As I understand it this one is why Microsoft are rewriting Office backend stuff in Rust after writing it originally in C#

The rate at which MS is doing that is nowhere near where it would be if there were some significant economic value. You can compare that to the rate of adoption of other programming languages or even techniques like unit testing or code review. With any new product, you can expect some noise and experimentation, but the adoption of products that offer a big economic value is usually very, very fast, even in programming.

zozbot234 15 hours ago

> What you want to aim for is a RAM/CPU usage that matches the RAM/CPU ratio on the machine, as that's what you pay for.

This totally ignores the role of memory bandwidth, which is often the key bottleneck on multicore workloads. It turns out that using more RAM costs you more CPU, too, because the CPU time is being wasted waiting for DRAM transfers. Manual memory management (augmented with optional reference counting and "borrowed" references - not the pervasive refcounting of Swift, which performs less well than modern tracing GC) still wins unless you're dealing with the messy kind of workload where your reference graphs are totally unpredictable and spaghetti-like. That's the kind of problem that GC was really meant for. It's no coincidence that tracing GC was originally developed in combination with LISP, the language of graph-intensive GOFAI.

  • pron 14 hours ago

    > It turns out that using more RAM costs you more CPU

    Yes, memory bandwidth adds another layer of complication, but it doesn't matter so much once your live set is much larger than your L3 cache. I.e. a 200MB live set and a 100GB live set are likely to require the same bandwidth. Add to that the fact that tracing GCs' compaction can also help (with prefetching) and the situation isn't so clear.

    > That's the kind of problem that GC was really meant for.

    Given the huge strides in tracing GCs over the past ten and even five years, and their incredible performance today, I don't think it matters what those of 40+ years ago were meant for, but I agree there are still some workloads - not anything that isn't spaghetti-like, but specifically arenas - that are more efficient than tracing GCs (young-gen works a little like an arena but not quite), which is why GCs are now turning their attention to that kind of workload, too. The point remains that it's very useful to have a memory management approach that can turn the RAM you've already paid for to reduce CPU consumption.

    Indeed, we're not seeing any kind of abandonment of tracing GC at a rate that is even close to suggesting some significant economic value in abandoning them (outside of very RAM-constrained hardware, at least).

    • zozbot234 14 hours ago

      > The point remains that it's very useful to have a memory management approach that can turn the RAM you've already paid for to reduce CPU consumption.

      That approach is specifically arenas: if you can put useful bounds on the maximum size of your "dead" data, it can pay to allocate everything in an arena and free it all in one go. This saves you the memory traffic of both manual management and tracing GC. But coming up with such bounds involves manual choices, of course.

      It goes without saying that memory compaction involves a whole lot of extra traffic on the memory subsystem, so it's unlikely to help when memory bandwidth is the key bottleneck. Your claim that a 200MB working set is probably the same as a 100GB working set (or, for that matter, a 500MB or 1GB working set, which is more in the ballpark of real-world comparisons) when it comes to how it's impacted by the memory bottleneck is one that I have some trouble understanding also - especially since you've been arguing for using up more memory for the exact same workload.

      Your broader claim wrt. memory makes a whole lot of sense in the context of how to tune an existing tracing GC when that's a forced choice anyway (which, AIUI, is also what the talk is about!) but it just doesn't seem all that relevant to the merits of tracing GC vs. manual memory management.

      > we're not seeing any kind of abandonment of tracing GC at a rate that is even close to suggesting some significant economic value in abandoning them

      We're certainly seeing a lot of "economic value" being put on modern concurrent GC's that can at least perform tolerably well even without a lot of memory headroom. That's how the Golang GC works, after all.

      • pron 14 hours ago

        > It goes without saying that memory compaction involves a whole lot of extra traffic on the memory subsystem

        It doesn't go without saying that compaction involves a lot of memory traffic, because memory is utilised to reduce the frequency of GC cycles and only live objects are copied. The whole point of tracing collection is that extra RAM is used to reduce the total amount of memory management work. If we ignore the old generation (which the talk covers separately), the idea is that you allocate more and more in the young gen, and when it's exhausted you compact only the remaining live objects (which is a constant for the app); the more memory you assign to the young gen, the less frequently you need to do even that work. There is no work for dead objects.

        > when it comes to how it's impacted by the memory bottleneck is one that I have some trouble understanding also - especially since you've been arguing for using up more memory for the exact same workload.

        Memory bandwidth - at least as far as latency is concerned - is used when you have a cache miss. Once your live set is much bigger than your L3 cache, you get cache misses even when you want to read it. If you have good temporal locality (few cache misses), it doesn't matter how big your live set is, but the same is if you have bad temporal locality (many cache misses).

        > which, AIUI, is also what the talk is about

        The talk focuses on tracing GCs, but it applies equally to manual memory management (as discussed in the Q&A; using less memory for the same algorithm requires CPU work regardless if it's manual or automatic)

        > when that's a forced choice

        I don't think tracing GCs are ever a forced choice. They keep getting chosen over and over for heavy workloads on machines with >= 1GB/core because they offer a more attractive tradeoff than other approaches for some of the most popular application domains. There's little reason for that to change unless the economics of DRAM/CPU change significantly.