Advent of Compiler Optimisations 2025
(xania.org)368 points by vismit2000 a day ago
368 points by vismit2000 a day ago
The “understand one layer below where you work” is something my professors at uni told us 10+ years ago. Not sure where that originated from, but I really think that benefited me in my career. I.e understanding the JVM when dealing with Java helped optimize code in a relatively heavyweight medical software package.
And also, it’s just fun to understand the lower layers.
https://cacm.acm.org/research/always-measure-one-level-deepe... This has been a classic repeat in my grad classes.
This is apparently such a common misunderstanding that it was put at the bottom of the C++ iceberg:
I _think_ so, but this could all be some kind of simulation, I guess? :)
After 25-years of software development, I still wonder whether I’m using the best possible compiler flags.
What I've learned is that the fewer flags is the best path for any long lived project.
-O2 is basically all you usually need. As you update your compiler, it'll end up tweaking exactly what that general optimization does based on what they know today.
Because that's the thing about these flags, you'll generally set them once at the beginning of a project. Compiler authors will reevaluate them way more than you will.
Also, a trap I've observed is setting flags based on bad benchmarks. This applies more to the JVM than a C++ compiler, but never the less, a system's current state is somewhat random. 1->2% fluctuations in performance for even the same app is normal. A lot of people won't realize that and ultimately add flags based on those fluctuations.
But further, how code is currently layed out can affect performance. You may see a speed boost not because you tweaked the loop unrolling variable, but rather your tweak may have relocated a hot path to be slightly more cache friendly. A change in the code structure can eliminate that benefit.
That's great if you're compiling for use on the same machine or those exactly like it. If you're compiling binaries for wider distribution it will generate code that some machines can't run and won't take advantage of features in others.
To be able to support multiple arch levels in the same binary I think you still need to do manual work of annotating specific functions where several versions should be generated and dispatched at runtime.
Doesn't -O2 still exclude any CPU features from the past ~15 years (like AVX).
If you know the architecture and oldest CPU model, we're better served with added a bunch more flags, no?
I wish I could compile my server code to target CPU released on/after a particular date like:
-O2 -cpu-newer-than=2019A CPU produced after a certain date is not guaranteed to have the every ISA extension, e.g. SVE for Arm chips. Hence things like the microarchitecure levels for x86-64.
You should at a minimum add flags to enable dead object collection (-fdata-sections and -ffunction-sections for compilation and -Wl,--gc-sections for the linker).
Historically, -O3 has been a bit less stable (producing incorrect code) and more experimental (doesn't always make things faster).
Flags from -O3 often flow down into -O2 as they are proven generally beneficial.
That said, I don't think -O3 has the problems it once did.
Compiler speed matters. I will confess to not as much practical knowledge of -O3, but -O2 is usually reasonable fast to compile.
For cases where -O2 is too slow to compile, dropping a single nasty TU down to -O1 is often beneficial. -O0 is usually not useful - while faster for tiny TUs, -O1 is still pretty fast for them, and for anything larger, the increased binary size bloat of -O0 is likely to kill your link time compared to -O1's slimness.
Also debuggability matters. GCC's `-O2` is quite debuggable once you learn how to work past the possibility of hitting an <optimized out> (going up a frame or dereferencing a casted register is often all you need); this is unlike Clang, which every time I check still gives up entirely.
The real argument is -O1 vs -O2 (since -O1 is a major improvement over -O0 and -O3 is a negligible improvement over -O2) ... I suppose originally I defaulted to -O2 because that's what's generally used by distributions, which compile rarely but run the code often. This differs from development ... but does mean you're staying on the best-tested path (hitting an ICE is pretty common as it is); also, defaulting to -O2 means you know when one of your TUs hits the nasty slowness.
While mostly obsolete now, I have also heard of cases where 32-bit x86 inline asm has difficulty fulfilling constraints under register pressure at low optimization levels.
Yeah, -O3 generally performs well in small benchmarks because of aggressive loop unrolling and inlining. But in large programs that face icache pressure, it can end up being slower. Sometimes -Os is even better for the same reason, but -O2 is usually a better default.
Most people use -O2 and so if you use -O3 you risk some bug in the optimizer that nobody else noticed yet. -O2 is less likely to have problems.
In my experience a team of 200 developers will see 1 compiler bug affect them every 10 years. This isn't scientific, but it is a good rule of thumb and may put the above in perspective.
People keep saying "O3 has bugs," but that's not true. At least no more bugs than O2. It did and does more aggressively expose UB code, but that isn't why people avoid O3.
You generally avoid O3 because it's slower. Slower to compile, and slower to run. Aggressively unrolling loops and larger inlining windows bloat code size to the degree it impacts icache.
The optimization levels aren't "how fast do you want to code to go", they're "how aggressive do you want the optimizer to be." The most aggressive optimizations are largely unproven and left in O3 until they are generally useful, at which point they move to O2.
40 years latter i still have nightmares of long sessions debuging lattice c.
I am personally interested in the code amalgamation technique that SQLite uses[0]. It seems like a free 5-10% performance improvement as is claimed by SQLite folks. Be nice if he addresses it some in one of the sessions.
This is a pretty standard topic, and not really a compiler optimization. It's usually called a unity build.
At my company, we have not seen any performance benefits from LTO on a GCC cross-compiled Qt application.
GCC version: 11.3 target: Cortex-A9 Qt version: 5.15
I think we tested single core and quad core, also possibly a newer GCC version, but I'm not sure. Just wanted to add my two cents.
I would expect a little benefit from devirt (but maybe in-TU optimizations are getting that already?), but if a program is pessimized enough, LTO's improvements won't be measurable.
And programs full of pointer-chasing are quite pessimized; highly-OO code is a common example, which includes almost all GUIs, even in C++.
Advent of Code for compiler nerds. Love this format - daily bite-sized optimization lessons build intuition far better than dense textbooks. Understanding what compilers do and why they do it makes you a better programmer in any language.
There's a link to the AoCO2025 tag for his blog posts in the op.
Thanks for sharing, I've always found optimizing a really interesting field, I will keep a close eye!
Matt is amazing. After checking out his compiler optimizations, maybe check out the recent interview I did with him.
https://corecursive.com/godbolt-rule-matt-godbolt/Also this article in acmqueue by Matt is not new at all, but super great introduction to these types of optimizations.
https://queue.acm.org/detail.cfm?id=3372264