Comment by mabster

Thanks for the references - that was interesting reading, particularly that initialisation can be good for instruction pipelining.

A trick we were using with SSE was something like

__m128 zero = _mm_undefined_ps(); zero = _mm_xor_ps(zero, zero);

Now we were really careful with viewing our ops as data dependencies to reason about pipelining efficiency. But our profiling tools were not measuring this.

We did avoid _mm_set_ps(0.0f) which was actually showing up as cache misses.

I wonder if we were actually slower because cache misses are something we can measure?!