Comment by timschmidt

Comment by timschmidt 3 months ago

It's perfectly possible to run LLMs quickly on CPUs. An Epyc or Xeon with 12 memory channels achieves similar memory bandwidth to a 4090, which is the limiting factor. Engineering sample Epycs in kits with motherboard and RAM are available on Aliexpress for reasonable prices even.

ein0p 3 months ago

Did I say it wasn't? If your context is short and your model is small, it is possible to run LLMs on high-end CPUs able to support 12 channels of high-spec DDR5 RDIMMs. It's not possible to run them as fast as they'd run on a GPU equipped with HBM though. Nor would it be even remotely as energy efficient. Also, it's not possible to run LLMs quickly on CPU if your context is long, because CPUs do not have the requisite FLOPS to process long context quickly. And before you bring MoE into the conversation, MoE only affects the feedforward part of each transformer block, and full memory bandwidth and compute savings are only realized at batch size 1, sequence length 1, AKA the most inefficient mode that nobody other than Ollama users use in practice. Sequence length 8 (common for speculative decoding) could be using up to 8x37B parameters (assuming you want to run DeepSeek - the strongest available open weights model). Batch size of even 2 with sequence length 8 could use almost all parameters if you're particularly unlucky. Prompt will almost certainly use all parameters, and will slam into the FLOPS wall of your EPYC's ALUs. So can LLMs (with an emphasis on "Large") be run on CPUs? Yes. Are you going to have a good time running them this way? No.

Reply View 3 replies

timschmidt 3 months ago

llamafile contains specific optimizations for prompt processing using AVX512 for dealing with just this issue: https://justine.lol/matmul/ (about a 10x speedup over llama.cpp)
Somewhere between 8 and 192 cores I'm sure there's enough AVX512 to get the job done. And we've managed to reinvent Intel's Larrabee / Knights concept.
Sadly, the highly optimized AVX512 kernels of llamafile don't support these exotic floats yet as far as I know.
Yes, energy efficiency per query will be terrible compared to a hyperscaler. However privacy will be perfect. Flexibility will be higher than other options - as running on the CPU is almost always possible. Even with new algorithms and experimental models.

Reply View | 2 replies
- ein0p 3 months ago
  
  At 192 cores you're way better off buying a Mac Studio, though.
  
  Reply View | 1 reply
  
  bigyabai 2 months ago
  
  [flagged]
  
  Reply View | 0 replies