Comment by zeroxfe

Comment by zeroxfe 2 days ago

I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai

BeetleB 2 days ago

Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).

https://z.ai/subscribe

Reply View 9 replies

InsideOutSanta 2 days ago

There's no comparison. GLM 4.7 is fine and reasonably competent at writing code, but K2.5 is right up there with something like Sonnet 4.5. it's the first time I can use an open-source model and not immediately tell the difference between it and top-end models from Anthropic and OpenAI.

Reply View | 1 reply
- [removed] a day ago
  
  [deleted]
  
  Reply View | 0 replies
Alifatisk a day ago

Kimi k2.5 is a beast, speaks very human like (k2 was also good at this) and completes whatever I throw at it. However, the glm quarterly coding plan is too good of a deal. The Christmas deal ends today, so I’d still suggest to stick to it. There will always come a better model.

Reply View | 0 replies
zeroxfe 2 days ago

It's waaay better than GLM 4.7 (which was the open model I was using earlier)! Kimi was able to quickly and smoothly finish some very complex tasks that GLM completely choked at.

Reply View | 0 replies
segmondy 2 days ago

The old Kimi K2 is better than GLM4.7

Reply View | 0 replies
cmrdporcupine 2 days ago

From what people say, it's better than GLM 4.7 (and I guess DeepSeek 3.2)
But it's also like... 10x the price per output token on any of the providers I've looked at.
I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.
Comes down to ethical or openness reasons to use it I guess.

Reply View | 1 reply
- esafak 2 days ago
  
  Exactly. For the price it has to beat Claude and GPT, unless you have budget for both. I just let GLM solve whatever it can and reserve my Claude budget for the rest.
  
  Reply View | 0 replies
akudha 2 days ago

Is the Lite plan enough for your projects?

Reply View | 1 reply
- BeetleB 2 days ago
  
  Very much so. I'm using it for small personal stuff on my home PC. Nothing grand. Not having to worry about token usage has been great (previously was paying per API use).
  I haven't stress tested it with anything large. Both at work and home, I don't give much free rein to the AI (e.g. I examine and approve all code changes).
  Lite plan doesn't have vision, so you cannot copy/paste an image there. But I can always switch models when I need to.
  
  Reply View | 0 replies

HarHarVeryFunny 21 hours ago

It is possible to run locally though ... I saw a video of someone running one of the heavily quantized versions on a Mac Studio, and performing pretty well in terms of speed.

I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.

Reply View 1 reply

tucnak 21 hours ago

> It is possible to run locally though
> running one of the heavily quantized versions
There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.

Reply View | 0 replies

rc1 2 days ago

How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

Anyone have a projection?

Reply View 7 replies

johndough 2 days ago

You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.
Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/

Reply View | 3 replies
- hedgehog 2 days ago
  
  High end consumer SSDs can do closer to 15 GB/s, though only with PCI-e gen 5. On a motherboard with two m.2 slots that's potentially around 30GB/s from disk. Edit: How fast everything is depends on how much data needs to get loaded from disk which is not always everything on MoE models.
  
  Reply View | 2 replies
  
  greenavocado a day ago
  
  Would RAID zero help here?
  
  Reply View | 1 reply
  
  hedgehog a day ago
  
  Yes, RAID 0 or 1 could both work in this case to combine the disks. You would want to check the bus topology for the specific motherboard to make sure the slots aren't on the other side of a hub or something like that.
  
  Reply View | 0 replies
segmondy 2 days ago

You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.

Reply View | 1 reply
- johndough 2 days ago
  
  I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?
  
  Reply View | 0 replies
heliumtera 2 days ago

You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.
Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.
If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.
It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.
Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

Reply View | 0 replies

jgalt212 a day ago

What's the point of using an open source model if you're not self-hosting?

Reply View 5 replies

oefrha 20 hours ago

Open source models can be hosted by provider, in particular plenty of educational institutions host open source models. You get to choose whatever provider you trust. For instance I used DeepSeek R1 a fair bit last year but never on deepseek.com or through its API.

Reply View | 0 replies
dimava a day ago

Open source models costs are determined only by electricity usage, as anyone can rent a GPU qnd host them Closed source models cost x10 more just because they can A simple example is Claude Opus, which costs ~1/10 if not less in Claude Code that doesn't have that price multiplier

Reply View | 2 replies
- jgalt212 21 hours ago
  
  But Kimi seems so big that renting the necessary number of GPUs is a non trivial exercise.
  
  Reply View | 1 reply
  
  pstuart 17 hours ago
  
  Exactly! Electricity, hosting, and amortized cost of the GPUs would be the baseline costs.
  
  Reply View | 0 replies
elbear a day ago

* It's cheaper than proprietary models
* Maybe you don't want to have your conversations used for training. The providers listed on OpenRouter mention whether they do that or not.

Reply View | 0 replies