Comment by weitendorf

Comment by weitendorf a day ago

2 replies

For most people, before it makes sense to just buy all the hardware yourself, you probably should be renting GPUs by the hour from the various providers serving that need. On Modal, I think should cost about $72/hr to serve Kimi K2 https://modal.com/pricing

Once that's running it can serve the needs of many users/clients simultaneously. It'd be too expensive and underutilized for almost any individual to use regularly, but it's not unreasonable for them to do it in short intervals just to play around with it. And it might actually be reasonable for a small number of students or coworkers to share a $70/hr deployment for ~40hr/week in a lot of cases; in other cases, that $70/hr expense could be shared across a large number of coworkers or product users if they use it somewhat infrequently.

So maybe you won't host it at home, but it's actually quite feasible to self-host, and is it ever really worth physically hosting anything at home except as a hobby?

apitman 11 hours ago

How does multi-user work, and how many users could it handle concurrently? My only experience is running much smaller models, and they easily peg my GPU at ~90 tokens/s. So maybe I could run 5-10 users at <10t/s? Does software like llama.cpp and ollama handle this?