Comment by zackangelo

We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.

I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.

[0] https://github.com/huggingface/candle