Comment by dist-epoch

Comment by dist-epoch a day ago

Do we really know that LEA is using the hardware memory address computation units? What if the CPU frontend just redirects it to the standard integer add units/execution ports? What if the hardware memory address units use those too?

It would be weird to have 2 sets of different adders.

adrian_b a day ago

The modern Intel/AMD CPUs have distinct ALUs (arithmetic-logic units, where additions and other integer operations are done; usually between 4 ALUs and 8 ALUs in recent CPUs) and AGUs (address generation units, where the complex addressing modes used in load/store/LEA are computed; usually 3 to 5 AGUs in recent CPUs).

Modern CPUs can execute up to between 6 and 10 instructions within a clock cycle, and up to between 3 and 5 of those may be load and store instructions.

So they have a set of execution units that allow the concurrent execution of a typical mix of instructions. Because a large fraction of the instructions generate load or store micro-operations, there are dedicated units for address computation, to not interfere with other concurrent operations.

Reply View 3 replies

krackers 6 hours ago

https://news.ycombinator.com/item?id=23514072 and https://news.ycombinator.com/item?id=12354494 seem to contradict this and claim that modern intel processors don't use separate AGU for LEA...
Not too versed here, but given that ADD seems to have more execution ports to pick from (e.g. on Skylake), I'm not sure that's an argument in favor of lea. I'd guess that LEA not touching flags and consuming fewer uops (comparing a single simple LEA to 2 ADDs) might be better for out of order execution though (no dependencies, friendlier to reorder buffer)

Reply View | 0 replies
dist-epoch a day ago

But can the frontend direct these computations based on what's available? If it sees 10 LEA instructions in a row, and it has 5 AGU units, can it dispatch 5 of those LEA instructions to other ALUs?
Or is it guaranteed that a LEA instruction will always execute on an AGU, and an ADD instruction always on an ALU?

Reply View | 1 reply
- adrian_b a day ago
  
  This can vary from CPU model to CPU model.
  No recent Intel/AMD CPU executes directly LEA or other instructions, they are decoded into 1 or more micro-operations.
  The LEA instructions are typically decoded into either 1 or 2 micro-operations. The addressing modes that add 3 components are usually decoded into 2 micro-operations, like also the obsolete 16-bit addressing modes.
  The AGUs probably have some special forwarding paths for the results towards the load/store units, which do not exist in ALUs. So it is likely that 1 of the up to 2 LEA micro-operations are executed only in AGUs. On the other hand, when there are 2 micro-operations it is likely that 1 of them can be executed in any ALU. It is also possible for the micro-operations generated by a LEA to be different from those of actual load/store instructions, so that they may also be executed in ALUs. This is decided by the CPU designer and it would not be surprising if LEAs are processed differently in various CPU models.
  
  Reply View | 0 replies

toast0 19 hours ago

> It would be weird to have 2 sets of different adders.

Not really. CPUs often have limited address math available separately from the ALU. On simple cores, it looks like a separate incrementer for the Program Counter, on x86 you have a lot of addressing modes that need a little bit of math; having address units for these kinds of things allows more effective pipelining.

> Do we really know that LEA is using the hardware memory address computation units?

There are ways to confirm. You need an instruction stream that fully loads the ALUs, without fully loading dispatch/commit, so that ALU throughput is the limit on your loop; then if you add an LEA into that instruction stream, it shouldn't increase the cycle count because you're still bottlenecked on ALU throughput and the LEA does address math separately.

You might be able to determine if LEAs can be dispatched to the general purpose ALUs if your instruction stream is something like all LEAs... if the throughput is higher than what could be managed with only address units, it must also use ALUs. But you may end up bottlenecked on instruction commit rather than math.

Reply View 0 replies