Comment by pron

Comment by pron a day ago

2 replies

> It goes without saying that memory compaction involves a whole lot of extra traffic on the memory subsystem

It doesn't go without saying that compaction involves a lot of memory traffic, because memory is utilised to reduce the frequency of GC cycles and only live objects are copied. The whole point of tracing collection is that extra RAM is used to reduce the total amount of memory management work. If we ignore the old generation (which the talk covers separately), the idea is that you allocate more and more in the young gen, and when it's exhausted you compact only the remaining live objects (which is a constant for the app); the more memory you assign to the young gen, the less frequently you need to do even that work. There is no work for dead objects.

> when it comes to how it's impacted by the memory bottleneck is one that I have some trouble understanding also - especially since you've been arguing for using up more memory for the exact same workload.

Memory bandwidth - at least as far as latency is concerned - is used when you have a cache miss. Once your live set is much bigger than your L3 cache, you get cache misses even when you want to read it. If you have good temporal locality (few cache misses), it doesn't matter how big your live set is, but the same is if you have bad temporal locality (many cache misses).

> which, AIUI, is also what the talk is about

The talk focuses on tracing GCs, but it applies equally to manual memory management (as discussed in the Q&A; using less memory for the same algorithm requires CPU work regardless if it's manual or automatic)

> when that's a forced choice

I don't think tracing GCs are ever a forced choice. They keep getting chosen over and over for heavy workloads on machines with >= 1GB/core because they offer a more attractive tradeoff than other approaches for some of the most popular application domains. There's little reason for that to change unless the economics of DRAM/CPU change significantly.

Ygg2 19 hours ago

> It doesn't go without saying that compaction involves a lot of memory traffic

It definitely tracks with my experience. Did you see Chrome on AMD EPYC with 2TB of memory? It reached like 10% of Mem utility but over 46% of CPU around 6000 tabs. Mem usage climbed steeply at first but got overtaken by CPU usage.

  • pron 18 hours ago

    I have no idea what it's using its CPU on, whether it has anything to do with memory management, or what memory management algorithm is in use. Obviously, the GC doesn't need to do any compaction if the program isn't allocating, and the program can only allocate if it's actually doing some computation. Also, I don't know the ratio of live set to total heap. A tracing GC needs to do very little work if most of the heap is garbage (i.e. the ration of live set to the total memory is low), but any form of memory management - tracing or manual - needs to do a lot of work if the ratio is low. Remember, a tracing-moving GC doesn't spend any cycles on garbage; it spends cycles on live objects only. The more heap you give it (assuming the same allocation rate and live set), means more garbage and less CPU consumption (as GC cycles are less frequent).

    All you know is that CPU is exhausted before the RAM is, which, if anything, means that it may have been useful for Chrome to use more RAM (and reduce the liveset-to-heap ratio) to reduce CPU utilisation, assuming this CPU consumption has anything to do with memory management.

    There is no situation in which, given the same allocation rate and live set, adding more heap to a tracing GC makes it work more. That's why in the talk he says that a DIMM is a hardware accelerator for memory management if you have a tracing-moving collector: increase the heap and voila, less CPU is spent on memory management.

    That's why tracing-moving garbage collection is a great choice for any program that spends a significant amount of CPU on memory management, because then you can reduce that work by adding more RAM, which is cheaper than adding more CPU (assuming you're running on a machine that isn't RAM-constrained, like small embedded devices).