Comment by embedding-shape
Comment by embedding-shape 2 days ago
Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:
- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS
- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version
- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama
Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.