Comment by crackalamoo

There are FHE schemes that do better than binary gates (cf. CKKS) but they have other problems in that they require polynomial approximations for all the activation functions. Still they are much better than the binary-FHE schemes for stuff like neural networks, and most hardware accelerators in the pipeline right now are targeting CKKS and similar for this reason.

For some numbers, a ResNet-20 inference can be done in CKKS in like 5 minutes on CPU. With custom changes to the architecture you can get less than one minute, and in my view HW acceleration will improve that by another factor of 10-100 at least, so I'd expect 1s inference of these (still small) networks within the next year or two.

LLMs, however, are still going to be unreasonably slow for a long time.