Comment by lacoolj
you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?
I'm about to try this out lol
The 20b model is not great, so I'm hoping 120b is the golden ticket.
you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?
I'm about to try this out lol
The 20b model is not great, so I'm hoping 120b is the golden ticket.
> had better results with the 20b model, over the 120b model
The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.
Hmmm...now that you say that, it might have been the 20b model.
And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.
Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.
I was really impressed but disappointed at the huge disparity between time the two.
With MoE models like gpt-oss, you can run some layers on the CPU (and some on GPU): https://github.com/ggml-org/llama.cpp/discussions/15396
Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"