Comment by bri3d

Comment by bri3d 3 months ago

There are various compiler options like -ftrivial-auto-var-init to initialize uninitialized variables to specific (or random) values in some situations, but overall, randomizing (or zeroing) the full content of the stack in each function call would be a horrendous performance regression and isn't done for this reason.

neuroelectron 3 months ago

There are fast instructions (e.g., REP STOSx, AVX zero stores, dc zva) and tricks (MTE, zero pages), but no magic CPU instruction exists that transparently and efficiently randomizes or zeros the stack on function calls. You think there would be one and I bet there are on some specialized high-security systems, but I'm not sure even where you would find such a product. Telecom certainly isn't it.

Reply View 7 replies

db48x 3 months ago

There are proposed cpu architectures that work that way, like the Mill <https://millcomputing.com/>. Where most cpus support multiple calling conventions the Mill enforces a single calling convention in hardware. There is a hardware `call` instruction that does all the work directly, along with a corresponding `ret` instruction for returning from a function call. It also uses its equivalent of the TLB to ensure that each function is only granted permission to read from that portion of the stack which contains its arguments; any attempt to read outside that region would result in a permission error that causes the read to return a NaR (Not a Result, akin to a floating point NaN).
As an additional protection, new stack frames are implicitly zeroed as they are created. I assume this is done by filling the CPU cache with zeros for those addresses before continuing to execute the called function. No need to wait for actual zeros to be written to main memory.
https://millcomputing.com/wiki/Protection#Protecting_Stacks

Reply View | 2 replies
- smj-edison 3 months ago
  
  This is really interesting—how do stack references work in this design?
  
  Reply View | 1 reply
  
  db48x 3 months ago
  
  Technically I think you can read the “whole” stack; it’s only reads off of the ends of the stack as a whole that are prevented. However, note that the start of your current stack may not really be the start of the real stack.
  Consider the case of a system call, such as `read`. You’re in user space and you have some stack frames on the stack as usual. You allocate a buffer on the stack (there’s a cpu instruction for that; it basically just extends your “turf¹” to include more of the stack page, and zeros it as mentioned) to hold the data you want to read. You then call `read` with the `call` instruction, including the address of the buffer and the buffer size as arguments. So far everything is very straight–forward.
  But `read` is actually in a different protection domain; it’s part of the kernel. The CPU uses metadata previously set up by the kernel to turn this into a “portal call”. After the portal call your thread will be given a different protection domain. In principle this is the kernel’s protection domain, but in reality the kernel might split that up in many complicated ways. What is relevant here is that the turf of this protection domain has been modified to include this new stack frame. From the perspective of `read`, the stack has just started; there are no prior frames. The reality is that this stack frame is still part of the stack of the caller, it’s only the turf that has changed. Those prior stack frames still exist, but they are unreadable. Worse, the buffer is also unreadable; it’s located at an address that is not part of the kernel’s turf.
  So obviously there needs to be another set of instructions for modifying turfs. The full set of obvious modifications are available, but the relevant one here is a temporary grant of read and/or write permissions to a function you are about to call. You would insert a `pass` instruction to pass along access to the buffer for the duration of the call. This access is automatically revoked after the call returns. (Ideally you wouldn’t actually have to do this manually for every portal call; instead you would call a non–portal `read` function in libc. This function’s job is to make the portal call, and whoever wrote it makes sure to include the `pass` instruction.)
  ¹ A turf is the set of addresses that a given thread running in a given protection domain can read and/or write.
  
  Reply View | 0 replies
mjevans 3 months ago

You couldn't do random, but with a predictable performance hit to memory, cache and write-line use stack addresses COULD be isolated for a program, for a library, etc.
It'd be expensive though; every context switch would require it's own stack and pushing / restoring one more register. There's GOOD reason programs don't work that way and are supposed to not rely on values outside of properly initialized (and not later clobbered) memory.

Reply View | 2 replies
- neuroelectron 3 months ago
  
  It should be efficient though, that's the point. Specialized hardware or instructions should be able to zero the stack in a single cycle, instead it's much more expensive. Of course the problem with this is it could be used to hide things just as easily, making it impossible to reverse engineer an unknown exploit.
  
  Reply View | 1 reply
  
  mjevans 3 months ago
  
  Why would a specialized instruction be necessary? 'the stack' is stored in memory just like everything else.
  Expensive is the (very slow for modern CPUs) operation of _writing_ that change in value out to memory at it's distant and slow speed compared to that which the CPU operates at, as well as the overhead of synchronizing that write to any other caches of those memory locations.
  Maybe you're thinking of the trick of a band new page of memory mapped memory that is 'zeroed' but is in reality just a special 'all zeros' page in the virtual to physical memory lookup table? Those still need to be zeroed by real writes at some point, if they're ever used.
  
  Reply View | 0 replies
dwattttt 3 months ago

CPUs already special case xor reg,reg as zeroing out the register, breaking any data dependency on it. If zeroing bits of the stack were common enough, I'd believe CPUs could be made that handled it efficiently (they already special case the stack; push/pop)

Reply View | 0 replies

smarks 3 months ago

I'm a bit distant from this stuff, but it looks like C++26 will have something like -ftrivial-auto-var-init enabled by default. See the "safe by default" section of [1].

For reference, the actual proposal that was accepted into C++26 is [2]. It discusses performance only in general, and it refers to an earlier analysis [3] for more details. This last reference describes regressions of around 0.5% in time and in code size. Earlier prototypes suggested larger regressions (perhaps even "horrendous") but more emphasis on compiler optimizations has brought the regression down considerably.

Of course one's mileage may vary, and one might also consider a 0.5% regression unacceptable. However, the C++ committee seems to have considered this to be an acceptable tradeoff to remove a frequent cause of undefined behavior from C++.

[1]: https://herbsutter.com/2024/08/07/reader-qa-what-does-it-mea...

[2]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p27...

[3]: https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2723r1...

Reply View 0 replies

canucker2016 3 months ago

Microsoft's Visual C++ compiler has the /Ge compiler option ( see https://learn.microsoft.com/en-us/cpp/build/reference/ge-ena... ) Deprecated since VC2005.

This compiler option causes the compiler to emit a call to a stack probe function to ensure that a sufficient amount of stack space is available.

Rather than just probe once for each stack page used, you can substitute a function that *FILLS* the stack frame with a particular value - something like 0xBAADF00D - one could set the value to anything you wanted at runtime.

This would get you similar behaviour to gcc/clang's -ftrivial-auto-var-init

Windows has started to auto-initialize most stack variables in the Windows kernel and several other areas.

    The following types are automatically initialized:
    
        Scalars (arrays, pointers, floats)
        Arrays of pointers
        Structures (plain-old-data structures)
    
    The following are not automatically initialized:
    
        Volatile variables
        Arrays of anything other than pointers (i.e. array of int, array of structures, etc.)
        Classes that are not plain-old-data


    During initial testing where we forcibly initialized all types of data on the stack we saw performance regressions of over 10% in several key scenarios.

    With POD structures only, performance was more reasonable. Compiler optimizations to eliminate redundant stores (both inside basic blocks and between basic blocks) were able to further drop the regression caused by POD structures from observable to noise-level for most tests.

    We plan on revisiting zero initializing all types (especially now that our optimizer has more powerful optimizations), we just haven’t gotten to it yet.

see https://web.archive.org/web/20200518153645/https://msrc-blog...

Reply View 0 replies