Comment by embedding-shape

Yeah, no way I'd do this if I paid per token. Next experiment will probably be local-only together with GPT-OSS-120b which according to my own benchmarks seems to still be the strongest local model I can run myself. It'll be even cheaper then (as long as we don't count the money it took to acquire the hardware).

mercutio2 5 days ago

What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.

Reply View 6 replies

embedding-shape 5 days ago

I have my own agent harness, and the inference backend is vLLM.

Reply View | 5 replies
- mercutio2 4 days ago
  
  Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.
  I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.
  
  Reply View | 0 replies
- storystarling 5 days ago
  
  Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?
  
  Reply View | 3 replies
  
  embedding-shape 5 days ago
  
  I don't, fits on my card with the full context, I think the native MXFP4 weights takes ~70GB of VRAM (out of 96GB available, RTX Pro 6000), so I still have room to spare to run GPT-OSS-20B alongside for smaller tasks too, and Wayland+Gnome :)
  
  Reply View | 2 replies