Comment by bdash
SSVE instructions are executed by the SME engine, which trades latency for throughput. SSVE is really intended to support use of SME, rather than as a replacement for Advanced SIMD on the CPU core itself.
The Apple Silicon CPU Optimization Guide has a lot of great information on SME and SSVE, along with more general information on optimizing for Apple's CPUs
A few quotes from Apple's guide that are particularly relevant to SSVE, from "SSVE Vector Execution Unit Optimization":
> Broadly, this unit is designed to support long vector and matrix operations performed on ZA storage _in the SME Processing Grid_.
> Recommendation: Use SSVE in a supporting role to enable high throughput SME grid computation.
> [Magnitude: High | Applicability: High] SSVE offers wide 64B vectors. While the ISA includes instructions that can operate on multi-vectors, the throughput is often only one 64B vector per cycle. Use SSVE to enable SME, which offers higher parallelism.
> Because of non-speculative execution, communication latencies, and in some cases long memory and computation latencies, SME engine instructions trail execution in the core by dozens to thousands of cycles. Any core compute instructions that consume data produced by the SME engine may have to wait an indeterminate (but long) amount of time for the data to arrive.
I was struck by the "Magnitude: High | Applicability: High" bit. Who writes like this? More importantly, who reads like this? The V4 doc (which I have yet to read, but I did a text search) has 64 occurences of this sort of phrasing; not actually all that many, given that there's 293 pages, but enough to be interesting. I wonder if this extra stuff is there to make LLMs pay particular attention.