What Every Hacker Should Know About TLB Invalidation [pdf]
(grsecurity.net)90 points by abhi9u 4 days ago
90 points by abhi9u 4 days ago
It's fixed there, though. I am also in the thread having replied several times when I was working on unicorn.
I'd be interested to see this written for reading (as opposed to presenting). Inferring the speaker's meaning from raw slides is a little challenging. I've had to interact with the minutiae of the x86 TLB in a past life, but happily have forgotten most of that.
Elaborate memory management (paging) systems need caching of lookups for high performance. But they can go wrong. The post was made in a security/safety context but did I miss something, because it didn't seems to make clear what the dangers are?
I only know x86/64, but I assume most page table caching would be somewhat similar.
Basically, if you don't handle the TLB properly, the CPU will not know that page mappings and/or page permissions have changed. So if you had a page mapped RW, and then changed the mapping to a RO page (such as setting up COW), but failed to flush the TLB (or at least call INVLPG to flush the entry), the CPU might use those stale permissions and grant write access on that page when it shouldn't. The same could happen for changing a region of the VA space to use a different physical page, where the next bit of code would hit the old page (and who knows what state it might be in or what it could be being used for).
The TLB is not super-complicated, but it has some quirks (it's been so long since I've done anything with it, the PCID handling rules were new to me; didn't even support it back when).
The article (towards its end) discusses a serious bug in the INVLPG instruction of the Intel Gracemont processor cores (which are the E-cores used in Alder Lake, Raptor Lake, Raptor Lake Refresh, Alder Lake N, Amston Lake, Twin Lake), which fails to invalidate all the entries that it should invalidate, in certain circumstances.
I'm no expert on TLB invalidation bugs but generally they allow for an attacker to read/write arbitrary memory.
https://googleprojectzero.blogspot.com/2019/01/taking-page-f...
I don't mean to be a pedant, so someone please correct me if I'm wrong, but I don't think TLB mishandling would result in arbitrary memory access (I suppose in the strictest sense arbitrary can just mean random, but generally I have understood it to imply that the address can be attacker controlled, which a stale TLB wouldn't allow).
Unless you're like Microsoft (from your link) and accidentally leave the page tables writable from userspace for 2 months. But that's not really a TLB error, that's just L-O-L, wow!
Random access is arbitrary access, given enough time. You can try over and over again until you get lucky.
Imagine I'm a user with local shell access trying to read a secret owned by root. Maybe I can't read the secret, but I can do something which makes another program read the secret. If I can make that program swap (perhaps by wasting a bunch of RAM to create memory pressure), and swapping has some probability of triggering a TLB invalidation bug that lets me see the old page, I win, although it might take awhile.
It looks like the last 20 or so pages of the PDF contain two case studies. I read the first one, which lead to (nondeterministic) kernel errors.
Perhaps “hacker” should be “crazy bug debugger”, but anybody who is working with TLB issues is a hacker in my book.
There is no “CVE” vulnerability in the slides, for sure.
And us machine emulators too, like Fabrice Bellard (QEMU) and me (and my OP post detailed the failing of emulated TLB in QEMU as discovered by Unicorn emulator).
Unicorn emulator - https://github.com/unicorn-engine/unicorn
Within QEMU amd64/x86 emulation (not KVM) mode, the IMUL opcode still fails its IMUL opcode emulation after an XOR opcode (also within the same TLB cached page) modified its neighbor IMUL operand(s).
TLB got double-invalidated yet some say never invalidated. The crux is a glitch within a singlr entire TLB invalidation operation thereby negating XOR opcode's ability to self-modify the neighboring IMUL operand. (Double ROT13, anyone?). I assert double-invalidation because within the same TLB invalidation stroke, XOR operation got performed ... twice, as opposed to retrieving and restoring original IMUL operand value after such invalidation thereby negating XOR computed result EITHER WAY.
A failure of self-modifying code within QEMU amd64/x86 emulation mode could be a useful test to determine if one is under QEMU emulation mode, of course if the page allows read-write-execute as often found in JavaScript, Java, Python and Dalvik (Android) bytecode memory regions.
Fabrice Bellard, author of QEMU, acknowledged the basic of above but failed amd64/x86 IMUL/XOR self-modify premise in emulation (not KVM) mode of QEMU.
https://github.com/unicorn-engine/unicorn/issues/364