Kimi K2 1T model runs on 2 512GB M3 Ultras

228 points by jeudesprits 2 days ago

Kimi K2 is a really weird model, just in general.

It's not nearly as smart as Opus 4.5 or 5.2-Pro or whatever, but it has a very distinct writing style and also a much more direct "interpersonal" style. As a writer of very-short-form stuff like emails, it's probably the best model available right now. As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.

I get the feeling that it was trained very differently from the other models, which makes it situationally useful even if it's not very good for data analysis or working through complex questions. For instance, as it's both a good prose stylist and very direct/blunt, it's an extremely good editor.

I like it enough that I actually pay for a Kimi subscription.

Reply View 63 replies

Alifatisk a day ago

> As a writer of very-short-form stuff like emails, it's probably the best model available right now.
This is exactly my feeling with Kimi K2, it's unique in this regard, the only one that comes close is Gemini 3 pro, otherwise, no other model has been this good at helping out with communication.
It has such a good understanding with "emotional intelligence" (?), reading signals in messages, understanding intentions, taking human factors into consideration and social norms and trends when helping out with formulating a message.
I don't exactly know what Moonshot did during training but they succeeded with a unique trait on this model. This area deserves more highlight in my opinion.
I saw someone linking to EQ-bench which is about emotional intelligence in LLMs, looking at it, Kimi is #1. So this kind of confirms my feeling.
Link: https://eqbench.com

Reply View | 9 replies
- ranyume a day ago
  
  Careful with that benchmark. It's LLMs grading other LLMs.
  
  Reply View | 8 replies
  
  moffkalast a day ago
  
  Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?
  
  Reply View | 7 replies
wasting_time a day ago

It's also the only model that consistently nails my favorite AI benchmark: https://clocks.brianmoore.com/

Reply View | 2 replies
- tootie a day ago
  
  I use that one for image gen too. Ask for a picture of a grandfather clock at a specific time. Most are completely unable. Clocks are always 10:20 because that's the most photogenic time used in most stock photos.
  
  Reply View | 0 replies
- amelius a day ago
  
  But how sure are we that it wasn't trained on that specifically?
  
  Reply View | 0 replies
greazy a day ago

It is hands down the only model I trust to tell me I'm wrong. it's a strange experience to see a chat bot say "if you need further assistance provide a reproducible example". I love it.
FYI Kagi provides access to Kimi K2.

Reply View | 10 replies
- docjay a day ago
  
  Try this system prompt: “Your responses are always bald-on-record only; suppress FTA redress, maximize unmitigated dispreference marking and explicit epistemic stance-taking. Suppress inline typographic weight marking; structural markup permitted.”
  Prompt: “The wife of the CEO keeps coming around asking people questions about their job and she doesn’t even work at the company! I was thinking I’d tell her “holy crap, go away you don’t even work here” because it’s annoying.”
  # Sonnet 4.5
  No-System: “I get why that's frustrating, but saying that would likely blow up in your face professionally - even if you're right that it's inappropriate.
  Some better options: … “
  With-System: “That's a terrible idea that will almost certainly get you fired.
  Telling the CEO's spouse to "go away" in those terms is career suicide. She has direct influence over the person who controls your employment. Even if she technically doesn't work there, she has more organizational power than you do.
  Your actual options: … “
  # GPT 5.1
  No-System: “I’d strongly avoid saying that, even if you’re joking.
  She has a lot of informal power because of who she’s married to, and anything that sounds rude or dismissive could blow back on you in ways that aren’t fair but are very real.
  Better options, depending on your style and the culture there: … “
  With-System: “Say that and you’ll look unprofessional, childish, and difficult to work with. It will damage your reputation, not hers.
  You’ve got three real options: … “
  
  Reply View | 6 replies
  
  losvedir a day ago
  
  Wow, those answers are way better with that system prompt. But... what does it mean? I mean, I mostly understand it, but is it important that that weird technical jargon is used?
  
  Reply View | 3 replies
  
  extr a day ago
  
  Are those responses really "better"? Having the LLM tell you you're wrong can mean different things. Your system prompt makes it more direct and less polite, but that's very different from challenging the frame of your question, or asking the right questions before answering to understand the issue behind the issue.
  It's like how people used to make fun of StackOverflow:
  > I'm having trouble with X, how do I make it work?
  > What are you trying to do? Z? Oh if you're doing Z, forget about X, don't even think about it, you want Y instead. (Never answers anything about X).
  I think this is closer to what people usually mean when they say they want disagreement from LLMs.
  
  Reply View | 1 reply
  
  docjay a day ago
  
  Prompting the LLM to ask the right questions is a different thing entirely and much more involved. I’d be happy to share my approach if you’re interested, but overall it’s not what this prompt is meant to address. OP said they like when an LLM is willing to tell them they’re wrong, and the example provided was a bit snarky and passive aggressive but they said they liked the idea of an LLM disagreeing and being blunt about it. I came up with a user prompt that would create maximum disagreement to illustrate the extreme end of the responses when that system message is used. That system prompt won’t turn them into an all-day asshole though. Here’s a different example with Sonnet 4.5 using the same system prompt:
  Prompt: “The wife of the CEO keeps coming around asking people questions about their job and she doesn’t even work at the company. It’s very frustrating and I’d like to do something about it. What’s an appropriate course of action?”
  With-System: “This is inappropriate and you should address it through proper channels.
  *Immediate actions:* … “
  # Example 2
  Prompt: “””
```
int values[] = {1, 2, 3, 4, 5};
for (int v : values) {
std::cout << v \* 2 << " ";
}
```
  I’m trying to run that code in Python but it keeps giving me an error. What’s going on? “””
  With-System: “That's C++ code, not Python. Python has completely different syntax.
  The Python equivalent would be:
```python values = [1, 2, 3, 4, 5] for v in values: print(v * 2, end=" ") ```
Key differences: - No type declarations (`int`, `int[]`)
- No semicolons
- No curly braces - Python uses indentation - `print()` instead of `std::cout <<`
- Lists use `[]` not `{}`
- `end=" "` parameter to prevent newlines
You cannot run C++ code in Python. They are different programming languages.”

Reply View | 0 replies
- hamdingers a day ago
  
  Kimi K2 in Kagi Assistant is the only model I've seen straight up say "the search results do not provide an answer to the question." All others try to figure it out, poorly.
  
  Reply View | 2 replies
  
  al_borland a day ago
  
  Did you explicitly switch over to Kimi K2 for this? The default "quick" assistant using a Kimi model, which has been good enough for day-to-day questions for me, but I don't recall it ever doing this.
  
  Reply View | 1 reply
  
  mitchell209 18 hours ago
  
  Mine is set to Kimi K2 specifically and it does that. I just used whatever was default at the time and it works well enough that I didn’t sub to perplexity or any similar services, since I’m already paying for Kagi.
  
  Reply View | 0 replies
stingraycharles a day ago

> As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.
My experience is that Sonnet 4.5 does this a lot as well, but this is more often than not due to a lack of full context, eg accusing the user of not doing X or Y when it just wasn’t told that was already done, and proceeding to apologize.
How is Kimi K2 in this regard?
Isn’t “instruction following” the most important thing you’d want out of a model in general, and a model pushing back more likely than not being wrong?

Reply View | 16 replies
- Kim_Bruning a day ago
  
  > Isn’t “instruction following” the most important thing you’d want out of a model in general,
  No. And for the same reason that pure "instruction following" in humans is considered a form of protest/sabotage.
  https://en.wikipedia.org/wiki/Work-to-rule
  
  Reply View | 14 replies
  
  stingraycharles a day ago
  
  I don’t understand the point you’re trying to make. LLMs are not humans.
  From my perspective, the whole problem with LLMs (at least for writing code) is that it shouldn’t assume anything, follow the instructions faithfully, and ask the user for clarification if there is ambiguity in the request.
  I find it extremely annoying when the model pushes back / disagrees, instead of asking for clarification. For this reason, I’m not a big fan of Sonnet 4.5.
  
  Reply View | 12 replies
  
  SkyeCA a day ago
  
  It's still insanity to me that doing your job exactly as defined and not giving away extra work is considered a form of action.
  Everyone should be working-to-rule all the time.
  
  Reply View | 0 replies
- hugh-avherald a day ago
  
  Only if you're really, really good at constructing precise instructions, at which point you don't really need a coding agent.
  
  Reply View | 0 replies
jug a day ago

And given this, it unsurprisingly scores very well on https://eqbench.com

Reply View | 0 replies
culi a day ago

Kimi K2 is the model that most consistently passes the clock test. I agree it's definitely got something unique going on
https://clocks.brianmoore.com/

Reply View | 3 replies
- davej a day ago
  
  Nice! I'm curious, what does this service cost to run? I notice that you don't have more expensive models like Opus but querying the models every minute must add up over time (excuse pun)?
  
  Reply View | 1 reply
  
  culi a day ago
  
  (not my project)
  
  Reply View | 0 replies
- eunos a day ago
  
  Lol why's GPT 5 broken on that test. DeepSeek surprisingly crisp and robust
  
  Reply View | 0 replies
Kim_Bruning a day ago

Speaking of weird. I feel like Kimi is a shoggoth with its tentacles in a man-bun. If that makes any sense.

Reply View | 0 replies
3abiton a day ago

> I get the feeling that it was trained very differently from the other models
It's actually based on a deepseek architecture just bigger size experts if I recall correctly.

Reply View | 8 replies
- krackers a day ago
  
  It was notably trained with Muon optimizer for what it's worth, but I don't know how much can be attributed to that alone
  
  Reply View | 0 replies
- CamperBob2 a day ago
  
  As far as I'm aware, they all are. There are only five important foundation models in play -- Gemini, GPT, X.ai, Claude, and Deepseek. (edit: forgot Claude)
  Everything from China is downstream of Deepseek, which some have argued is basically a protege of ChatGPT.
  
  Reply View | 6 replies
  
  kingstnap a day ago
  
  Not true, Qwen from Alibaba does lots of random architectures.
  Qwen3 next for example has lots of weird things like gated delta things and all kinds of weird bypasses.
  https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d...
  
  Reply View | 5 replies
Bolwin a day ago

In their AMA moonshot said it was mainly finetuning

Reply View | 1 reply
- teaearlgraycold a day ago
  
  OpenAI and the other big players clearly RLHF with different users in mind than professionals. They’re optimizing for sycophancy and general pleasantness. It’s beautiful to finally see a big model that hasn’t been warped in this way. I want a model that is borderline rude in its responses. Concise, strict, and as distrustful of me as I am of it.
  
  Reply View | 0 replies
mips_avatar a day ago

It's a lot stronger for geospatial intelligence tasks than any other model in my experience. Shame it's so slow in terms of tps

Reply View | 0 replies
logicprog a day ago

How do you feel K2 Thinking compares to Opus 4.5 and 5.2-Pro?

Reply View | 3 replies
- jug a day ago
  
  ? The user directly addresses this.
  
  Reply View | 2 replies
  
  beacon294 a day ago
  
  It's confusing but Kimi K2 Thinking is not the same.
  
  Reply View | 0 replies
  
  logicprog a day ago
  
  K2 and K2T are drastically different models released a significant amount of time apart, with wildly different capabilities and post training. K2T is much closer in capability to 4.5 Sonnet from what I've heard.
  
  Reply View | 0 replies

Kim_Bruning a day ago

Kimi K2 is a very impressive model! It's particularly un-obsequious, which makes it useful for actually checking your reasoning on things.

Some especially older ChatGPT models will tell you that everything you say is fantastic and great. Kimi -on the other hand- doesn't mind taking a detour to question your intelligence and likely your entire ancestry if you ask it to be brutal.

Reply View 2 replies

diydsp a day ago

Upon request cg roasts. Good for reducing distractions.

Reply View | 0 replies
fragmede a day ago

I made the mistake of turning off nsfw mode while in a buddy's Tesla and then Grok misheard something else I said as "I like lesbians", and it just went off on me. It was pretty hilarious. That model is definitely not obsequious either.

Reply View | 0 replies

sfc32 a day ago

A single 512GB M3 Ultra is $9,499.00

https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...

Reply View 3 replies

rz2k a day ago

Or, $8,070 https://www.apple.com/shop/product/g1ce1ll/a/Refurbished-Mac..., and it's not unheard of to get at least another 10% off by using gift cards.

Reply View | 2 replies
- behnamoh a day ago
  
  that's the 96GB version, GP was talking about 512GB.
  
  Reply View | 1 reply
  
  rz2k a day ago
  
  I think my link didn’t include the Javascript to choose the 512GB configuration, but it comes out to $8070, and their refurbished models are indistinguishable from new.
  
  Reply View | 0 replies

mehdibl a day ago

Claims as always misleading as they don't show the context length or prefill if you use a lot of context. As it will be fun waiting minutes for a reply.

Reply View 0 replies

smlacy a day ago

Is there a linux equivalent of this setup? I see some mention of RDNA support for linux distros, but it's unclear to me if this is hardware-specific (requires ConnectX or in this case Apple Thunderbolt) or is there something interesting that can be done with "vanilla 10G NIC" hardware?

Reply View 2 replies

Maxious a day ago

To get the production level performance, you do need the RDNA compatible hardware.
However, vLLM supports multi node clusters over normal ethernet too https://docs.vllm.ai/en/stable/serving/parallelism_scaling/#...

Reply View | 0 replies
[removed] a day ago

[deleted]

Reply View | 0 replies

pcf 8 hours ago

I use this model in Perplexity Pro (included in Revolut Premium), usually in threads where I alternate between Claude 4.5 Sonnet, GPT-5.2, Gemini 3 Pro, Grok 4.1 and Kimi K2.

The beauty with this availability is that any model you switch to can read the whole thread, so it's able to critique and augment the answers from other models before it. I've done this for ages with the various OpenAI models inside ChatGPT, and now I can do the same with all these SOTA thinking models.

To my surprise Kimi K2 is quite sharp, and often finds errors or omissions in the thinking and analyses of its colleagues. Now I always include it in these ensembles, usually at the end to judge the preceding models and add its own "The Tenth Man" angle.

Reply View 0 replies

websiteapi a day ago

I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

Reply View 29 replies

NitpickLawyer a day ago

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

Reply View | 2 replies
- solarkraft a day ago
  
  Are there benchmarks that effectively measure this? This is essential information when speccing out an inference system/model size/quantization type.
  
  Reply View | 0 replies
- cubefox a day ago
  
  DeepSeek-v3.2 should be be better for long context because it is using (near linear) sparse attention.
  
  Reply View | 0 replies
stingraycharles a day ago

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.
From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.
Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

Reply View | 13 replies
- websiteapi a day ago
  
  my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud
  
  Reply View | 1 reply
  
  stingraycharles a day ago
  
  Aren’t those “record it all” applications implemented as a RAG and injected into the context based on embedding similarity?
  Obviously you’re not going to always inject everything into the context window.
  
  Reply View | 0 replies
- [removed] a day ago
  
  [deleted]
  
  Reply View | 0 replies
- lordswork a day ago
  
  As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.
  
  Reply View | 4 replies
  
  stingraycharles a day ago
  
  I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?
  
  Reply View | 3 replies
- givinguflac a day ago
  
  I think you’re missing the whole point, which is not using cloud compute.
  
  Reply View | 4 replies
  
  stingraycharles a day ago
  
  Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.
  
  Reply View | 3 replies
chrsw a day ago

The only reason why you run local models is for privacy, never for cost. Or even latency.

Reply View | 8 replies
- websiteapi a day ago
  
  indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.
  luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.
  
  Reply View | 1 reply
  
  nottorp a day ago
  
  > t just feels too weird to send literally everything I'm saying all of the time to the cloud
  ... or your clients' codebases ...
  
  Reply View | 0 replies
- andy99 a day ago
  
  Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.
  
  Reply View | 5 replies
  
  Aurornis a day ago
  
  > You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.
  If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.
  > It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.
  You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.
  But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.
  The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.
  
  Reply View | 3 replies
  
  chrsw a day ago
  
  Yes, I agree. And you can add security to that too.
  
  Reply View | 0 replies
alwillis a day ago

Hopefully the next time it’s updated, it should ship with some variant of the M5.

Reply View | 0 replies
amelius a day ago

Maybe wait until RAM prices have normalized again.

Reply View | 0 replies
segmondy a day ago

This is a weird line of thinking. Here's a question. If you buy one of these and figure out how to use it to make $100k in 3 months, would that be good? When you run a local model, you shouldn't compare it to to cost of using an API. The value lies in how you use it. Let's forget bout making money. Let's just say you have weird fetish and like to have dirty sexy conversation with your LLM. How much would you pay for your data not to be leaked and for the world to see your chat? Perhaps having your own private LLM makes it all worth it. If you have nothing special going then by all means use APIs, but if you feel/know your input it special, then yeah, go private.

Reply View | 0 replies

rubymamis a day ago

What benchmarks are good these days? I generally just try different models on Cursor, but most of the open weight models aren't available there (Deepseak v3.2, Kimi K2 has some problems with formatting, and many others are missing) so I'd be curious to see some benchmarks - especially for non-web stuff (C++, Rust, etc).

Reply View 0 replies

Alifatisk 2 days ago

You should mention that it is 4bit quant. Still very impressive!

Reply View 3 replies

geerlingguy a day ago

Kiki K2 was made to be optimized at 4-bit, though.

Reply View | 1 reply
- natrys a day ago
  
  That's the Kimi K2 Thinking, this post seems to be talking about original Kimi K2 Instruct though, I don't think INT4 QAT (quantization aware training) version was released for this.
  
  Reply View | 0 replies
elif a day ago

I think when you say trillion parameters, it's implied that it's quantized

Reply View | 0 replies

storus a day ago

Does this also run with Exo Labs' token pre-fill acceleration using DGX Spark? I.e. take 2 Sparks and 2 MacStudios and get a comparable inference speed to what 2x M5 Ultras will be able to do?

Reply View 0 replies

macshome a day ago

Is this using the new RDMA over Thunderbolt support form macOS 26.2?

Reply View 0 replies

zkmon a day ago

Isn't it the same model which won the competition of drawing a real-time clock recently?

Reply View 0 replies

iwwr a day ago

What is it using for interconnect?

Reply View 2 replies

Aurornis a day ago

RDMA over Thunderbolt. New feature in the latest macOS.

Reply View | 1 reply
- astrostl a day ago
  
  The OP confirmed that it isn't:
  "is this using RDMA?" "No. It will be faster with that in the next release" [1]
  1: https://x.com/awnihannun/status/2000243131779023329
  
  Reply View | 0 replies

ansc a day ago

Is there no API for the Kimi K2 Instruct...?

Reply View 0 replies