Comment by zeroxfe

Comment by zeroxfe 2 days ago

I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good as Opus.

rubslopes a day ago

Also my experience. I've been going back and forth between Opus and Kimi for the last few days, and, at least for my CRUD webapps, I would say they are both on the same level.

Reply View 0 replies

armcat 2 days ago

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

Reply View 33 replies

zeroxfe 2 days ago

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

Reply View | 26 replies
- BeetleB 2 days ago
  
  Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).
  https://z.ai/subscribe
  
  Reply View | 9 replies
  
  InsideOutSanta 2 days ago
  
  There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.
  
  Reply View | 1 reply
  
  [removed] a day ago
  
  [deleted]
  
  Reply View | 0 replies
  
  Alifatisk a day ago
  
  Kimi k2.5 is a beast, speaks very human like (k2 was also good at this) and completes whatever I throw at it. However, the glm quarterly coding plan is too good of a deal. The Christmas deal ends today, so I’d still suggest to stick to it. There will always come a better model.
  
  Reply View | 0 replies
  
  zeroxfe 2 days ago
  
  It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.
  
  Reply View | 0 replies
  
  segmondy 2 days ago
  
  The old Kimi K2 is better than GLM4.7
  
  Reply View | 0 replies
  
  cmrdporcupine 2 days ago
  
  From what people say, it's better than GLM 4.7 (and I guess DeepSeek 3.2)
  But it's also like... 10x the price per output token on any of the providers I've looked at.
  I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.
  Comes down to ethical or openness reasons to use it I guess.
  
  Reply View | 1 reply
  
  esafak 2 days ago
  
  Exactly. For the price it has to beat Claude and GPT, unless you have budget for both. I just let GLM solve whatever it can and reserve my Claude budget for the rest.
  
  Reply View | 0 replies
  
  akudha 2 days ago
  
  Is the Lite plan enough for your projects?
  
  Reply View | 1 reply
  
  BeetleB 2 days ago
  
  Very much so. I'm using it for small personal stuff on my home PC. Nothing grand. Not having to worry about token usage has been great (previously was paying per API use).
  I haven't stress tested it with anything large. Both at work and home, I don't give much free rein to the AI (e.g. I examine and approve all code changes).
  Lite plan doesn't have vision, so you cannot copy/paste an image there. But I can always switch models when I need to.
  
  Reply View | 0 replies
- HarHarVeryFunny 21 hours ago
  
  It is possible to run locally though ... I saw a video of someone running one of the heavily quantized versions on a Mac Studio, and performing pretty well in terms of speed.
  I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.
  
  Reply View | 1 reply
  
  tucnak 21 hours ago
  
  > It is possible to run locally though
  > running one of the heavily quantized versions
  There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.
  
  Reply View | 0 replies
- rc1 2 days ago
  
  How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.
  Anyone have a projection?
  
  Reply View | 7 replies
  
  johndough 2 days ago
  
  You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.
  Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/
  
  Reply View | 3 replies
  
  segmondy 2 days ago
  
  You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.
  
  Reply View | 1 reply
  
  johndough 2 days ago
  
  I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?
  
  Reply View | 0 replies
  
  heliumtera 2 days ago
  
  You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.
  Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.
  If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.
  It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.
  Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).
  
  Reply View | 0 replies
- jgalt212 a day ago
  
  What's the point of using an open source model if you're not self-hosting?
  
  Reply View | 5 replies
  
  oefrha 20 hours ago
  
  Open source models can be hosted by provider, in particular plenty of educational institutions host open source models. You get to choose whatever provider you trust. For instance I used DeepSeek R1 a fair bit last year but never on deepseek.com or through its API.
  
  Reply View | 0 replies
  
  dimava a day ago
  
  Open source models costs are determined only by electricity usage, as anyone can rent a GPU qnd host them Closed source models cost x10 more just because they can A simple example is Claude Opus, which costs ~1/10 if not less in Claude Code that doesn't have that price multiplier
  
  Reply View | 2 replies
  
  elbear a day ago
  
  * It's cheaper than proprietary models
  * Maybe you don't want to have your conversations used for training. The providers listed on OpenRouter mention whether they do that or not.
  
  Reply View | 0 replies
Carrok 2 days ago

Not OP but OpenCode and DeepInfra seems like an easy way.

Reply View | 0 replies
kristianp 10 hours ago

Note that Kimi K2x is natively 4 bit int, which reduces the memory requirements somewhat.

Reply View | 0 replies
observationist a day ago

API costs on these big models over private hosts tend to be a lot less than API calls to the big 4 American platforms. You definitely get more bang for your buck.

Reply View | 0 replies
tgrowazay 2 days ago

Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.
> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.

Reply View | 2 replies
- CamperBob2 2 days ago
  
  You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.
  I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.
  
  Reply View | 1 reply
  
  ttul a day ago
  
  That would be quite the space heater, too!
  
  Reply View | 0 replies

timwheeler a day ago

Did you use Kimi Code or some other harness? I used it with OpenCode and it was bumbling around through some tasks that Claude handles with ease.

Reply View 2 replies

zedutchgandalf a day ago

Are you on the latest version? They pushed an update yesterday that greatly improved Kimi K2.5’s performance. It’s also free for a week in OpenCode, sponsored by their inference provider

Reply View | 1 reply
- ekabod a day ago
  
  But it may be a quantized model for the free version.
  
  Reply View | 0 replies

thesurlydev 2 days ago

Can you share how you're running it?

Reply View 24 replies

eknkc 2 days ago

I've been using it with opencode. You can either use your kimi code subscription (flat fee), moonshot.ai api key (per token) or openrouter to access it. OpenCode works beautifully with the model.
Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.

Reply View | 2 replies
- Imustaskforhelp 2 days ago
  
  I tried to use opencode for kimi k2.5 too but recently they changed their pricing from 200 tool requests/5 hour to token based pricing.
  I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3
  So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.
  Kimi Cli's pretty good too imo. You should check it out!
  https://github.com/MoonshotAI/kimi-cli
  
  Reply View | 1 reply
  
  nl a day ago
  
  I like Kimi-cli but it does leak memory.
  I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.
  
  Reply View | 0 replies
zeroxfe 2 days ago

Running it via https://platform.moonshot.ai -- using OpenCode. They have super cheap monthly plans at kimi.com too, but I'm not using it because I already have codex and claude monthly plans.

Reply View | 2 replies
- esafak 2 days ago
  
  Where? https://www.kimi.com/code starts at $19/month, which is same as the big boys.
  
  Reply View | 0 replies
- UncleOxidant 2 days ago
  
  so there's a free plan at moonshot.ai that gives you some number of tokens without paying?
  
  Reply View | 0 replies
JumpCrisscross a day ago

> Can you share how you're running it?
Not OP, but I've been running it through Kagi [1]. Their AI offering is probably the best-kept secret in the market.
[1] https://help.kagi.com/kagi/ai/assistant.html

Reply View | 2 replies
- deaux a day ago
  
  Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?
  
  Reply View | 1 reply
  
  lejalv 21 hours ago
  
  > Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?
  Yes, it is chat only, but that list is out of date - Kimi 2.5 (with or without reasoning) is available, as are ChatGPT 5.2, Gemini 3 Pro (Preview), etc
  
  Reply View | 0 replies
explorigin 2 days ago

https://unsloth.ai/docs/models/kimi-k2.5
Requirements are listed.

Reply View | 5 replies
- KolmogorovComp 2 days ago
  
  To save everyone a click
  > The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
  
  Reply View | 4 replies
  
  Gracana 2 days ago
  
  I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.
  
  Reply View | 3 replies
indigodaddy a day ago

Been using K2.5 Thinking via Nano-GPT subscription and `nanocode run` and it's working quite nicely. No issues with Tool Calling so far.

Reply View | 0 replies
gigatexal 2 days ago

Yeah I too am curious. Because Claude code is so good and the ecosystem so just it works that I’m Willing to pay them.

Reply View | 7 replies
- epolanski 2 days ago
  
  You can plug another model in place of Anthropic ones in Claude Code.
  
  Reply View | 5 replies
  
  zeroxfe 2 days ago
  
  That tends to work quite poorly because Claude Code does not use standard completions APIs. I tried it with Kimi, using litellm[proxy], and it failed in too many places.
  
  Reply View | 3 replies
  
  miroljub 2 days ago
  
  If you don't use Antrophic models there's no reason to use Claude Code at all. Opencode gives so much more choice.
  
  Reply View | 0 replies
- Imustaskforhelp 2 days ago
  
  I tried kimi k2.5 and first I didn't really like it. I was critical of it but then I started liking it. Also, the model has kind of replaced how I use chatgpt too & I really love kimi 2.5 the most right now (although gemini models come close too)
  To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.
  It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)
  I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.
  Thank you Kimi!
  
  Reply View | 0 replies