Comment by storystarling

Comment by storystarling 5 days ago

8 replies

That 19 EUR figure is basically subscription arbitrage. If you ran that volume through the API with xhigh reasoning the cost would be significantly higher. It doesn't seem scalable for non-interactive agents unless you can stay on the flat-rate consumer plan.

embedding-shape 5 days ago

Yeah, no way I'd do this if I paid per token. Next experiment will probably be local-only together with GPT-OSS-120b which according to my own benchmarks seems to still be the strongest local model I can run myself. It'll be even cheaper then (as long as we don't count the money it took to acquire the hardware).

  • mercutio2 5 days ago

    What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.

    • embedding-shape 5 days ago

      I have my own agent harness, and the inference backend is vLLM.

      • mercutio2 4 days ago

        Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.

        I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.

      • storystarling 5 days ago

        Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?