Kimi K2.5 Technical Report [pdf]

372 points by vinhnx 2 days ago

zeroxfe 2 days ago

I've been using this model (as a coding agent) for the past few days, and it's the first time I've felt that an open source model really competes with the big labs. So far it's been able to handle most things I've thrown at it. I'm almost hesitant to say that this is as good as Opus.

Reply View 63 replies

rubslopes a day ago

Also my experience. I've been going back and forth between Opus and Kimi for the last few days, and, at least for my CRUD webapps, I would say they are both on the same level.

Reply View | 0 replies
armcat 2 days ago

Out of curiosity, what kind of specs do you have (GPU / RAM)? I saw the requirements and it's a beyond my budget so I am "stuck" with smaller Qwen coders.

Reply View | 33 replies
- zeroxfe 2 days ago
  
  I'm not running it locally (it's gigantic!) I'm using the API at https://platform.moonshot.ai
  
  Reply View | 26 replies
  
  BeetleB 2 days ago
  
  Just curious - how does it compare to GLM 4.7? Ever since they gave the $28/year deal, I've been using it for personal projects and am very happy with it (via opencode).
  https://z.ai/subscribe
  
  Reply View | 9 replies
  
  HarHarVeryFunny 19 hours ago
  
  It is possible to run locally though ... I saw a video of someone running one of the heavily quantized versions on a Mac Studio, and performing pretty well in terms of speed.
  I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.
  
  Reply View | 1 reply
  
  tucnak 19 hours ago
  
  > It is possible to run locally though
  > running one of the heavily quantized versions
  There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.
  
  Reply View | 0 replies
  
  rc1 a day ago
  
  How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.
  Anyone have a projection?
  
  Reply View | 7 replies
  
  jgalt212 20 hours ago
  
  What's the point of using an open source model if you're not self-hosting?
  
  Reply View | 5 replies
- kristianp 8 hours ago
  
  Note that Kimi K2x is natively 4 bit int, which reduces the memory requirements somewhat.
  
  Reply View | 0 replies
- Carrok 2 days ago
  
  Not OP but OpenCode and DeepInfra seems like an easy way.
  
  Reply View | 0 replies
- observationist a day ago
  
  API costs on these big models over private hosts tend to be a lot less than API calls to the big 4 American platforms. You definitely get more bang for your buck.
  
  Reply View | 0 replies
- tgrowazay 2 days ago
  
  Just pick up any >240GB VRAM GPU off your local BestBuy to run a quantized version.
  > The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.
  
  Reply View | 2 replies
  
  CamperBob2 a day ago
  
  You could run the full, unquantized model at high speed with 8 RTX 6000 Blackwell boards.
  I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.
  
  Reply View | 1 reply
  
  ttul a day ago
  
  That would be quite the space heater, too!
  
  Reply View | 0 replies
timwheeler a day ago

Did you use Kimi Code or some other harness? I used it with OpenCode and it was bumbling around through some tasks that Claude handles with ease.

Reply View | 2 replies
- zedutchgandalf a day ago
  
  Are you on the latest version? They pushed an update yesterday that greatly improved Kimi K2.5’s performance. It’s also free for a week in OpenCode, sponsored by their inference provider
  
  Reply View | 1 reply
  
  ekabod a day ago
  
  But it may be a quantized model for the free version.
  
  Reply View | 0 replies
thesurlydev 2 days ago

Can you share how you're running it?

Reply View | 24 replies
- eknkc 2 days ago
  
  I've been using it with opencode. You can either use your kimi code subscription (flat fee), moonshot.ai api key (per token) or openrouter to access it. OpenCode works beautifully with the model.
  Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.
  
  Reply View | 2 replies
  
  Imustaskforhelp 2 days ago
  
  I tried to use opencode for kimi k2.5 too but recently they changed their pricing from 200 tool requests/5 hour to token based pricing.
  I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3
  So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.
  Kimi Cli's pretty good too imo. You should check it out!
  https://github.com/MoonshotAI/kimi-cli
  
  Reply View | 1 reply
  
  nl a day ago
  
  I like Kimi-cli but it does leak memory.
  I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.
  
  Reply View | 0 replies
- zeroxfe 2 days ago
  
  Running it via https://platform.moonshot.ai -- using OpenCode. They have super cheap monthly plans at kimi.com too, but I'm not using it because I already have codex and claude monthly plans.
  
  Reply View | 2 replies
  
  esafak 2 days ago
  
  Where? https://www.kimi.com/code starts at $19/month, which is same as the big boys.
  
  Reply View | 0 replies
  
  UncleOxidant 2 days ago
  
  so there's a free plan at moonshot.ai that gives you some number of tokens without paying?
  
  Reply View | 0 replies
- JumpCrisscross a day ago
  
  > Can you share how you're running it?
  Not OP, but I've been running it through Kagi [1]. Their AI offering is probably the best-kept secret in the market.
  [1] https://help.kagi.com/kagi/ai/assistant.html
  
  Reply View | 2 replies
  
  deaux a day ago
  
  Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?
  
  Reply View | 1 reply
  
  lejalv 20 hours ago
  
  > Doesn't list Kimi 2.5 and seems to be chat-only, not API, correct?
  Yes, it is chat only, but that list is out of date - Kimi 2.5 (with or without reasoning) is available, as are ChatGPT 5.2, Gemini 3 Pro (Preview), etc
  
  Reply View | 0 replies
- explorigin 2 days ago
  
  https://unsloth.ai/docs/models/kimi-k2.5
  Requirements are listed.
  
  Reply View | 5 replies
  
  KolmogorovComp 2 days ago
  
  To save everyone a click
  > The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
  
  Reply View | 4 replies
- indigodaddy a day ago
  
  Been using K2.5 Thinking via Nano-GPT subscription and `nanocode run` and it's working quite nicely. No issues with Tool Calling so far.
  
  Reply View | 0 replies
- gigatexal 2 days ago
  
  Yeah I too am curious. Because Claude code is so good and the ecosystem so just it works that I’m Willing to pay them.
  
  Reply View | 7 replies
  
  epolanski 2 days ago
  
  You can plug another model in place of Anthropic ones in Claude Code.
  
  Reply View | 5 replies
  
  Imustaskforhelp 2 days ago
  
  I tried kimi k2.5 and first I didn't really like it. I was critical of it but then I started liking it. Also, the model has kind of replaced how I use chatgpt too & I really love kimi 2.5 the most right now (although gemini models come close too)
  To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.
  It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)
  I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.
  Thank you Kimi!
  
  Reply View | 0 replies

unleaded a day ago

Seems that K2.5 has lost a lot of the personality from K2 unfortunately, talks in more ChatGPT/Gemini/C-3PO style now. It's not explictly bad, I'm sure most people won't care but it was something that made it unique so it's a shame to see it go.

examples to illustrate

https://www.kimi.com/share/19c115d6-6402-87d5-8000-000062fec... (K2.5)

https://www.kimi.com/share/19c11615-8a92-89cb-8000-000063ee6... (K2)

Reply View 7 replies

zozbot234 a day ago

It's hard to judge from this particular question, but the K2.5 output looks at least marginally better AIUI, the only real problem with it is the snarky initial "That's very interesting" quip. Even then a British user would probably be fine with it.

Reply View | 0 replies
logicprog a day ago

I agree. K2 was blunt, straightforward, pretty... rational? K2.5 has a much stronger slop vibe.

Reply View | 0 replies
orbital-decay a day ago

K2 in your example is using the GPT reply template (tl;dr - terse details - conclusion, with contradictory tendencies), there's nothing unique about it. That's exactly how GPT-5.0 talked. The only model with a strong "personality" vibe was Claude 3 Opus.

Reply View | 2 replies
- user_7832 17 hours ago
  
  > The only model with a strong "personality" vibe was Claude 3 Opus.
  Did you have the chance to use 3.5 (or 3.6) Sonnet, and if yes, how did they compare?
  As a non-paying user, 3.5 era Claude was absolutely the best LLM I've ever used in terms of having a conversation. It felt like talking to a human and not a bot. Its replies were readable, even if they were several paragraphs long. I've unfortunately never found anything remotely as good.
  
  Reply View | 1 reply
  
  orbital-decay 17 hours ago
  
  Pretty poorly in that regard. In 3.5 they killed Claude 3's agency, pretty much reversing their previous training policy in favor of "safety", and tangentially mentioned that they didn't want to make the model too human-like. [1] Claude 3 was the last version of Claude, and one of the very few models in general, that had a character. That doesn't mean it wasn't writing slop though, falling into annoying stereotypes is still unsolved in LLMs.
  [1] https://www.anthropic.com/research/claude-character (see the last 2 paragraphs)
  
  Reply View | 0 replies
Grosvenor a day ago

[flagged]

Reply View | 1 reply
- Grimblewald a day ago
  
  Disagree, i've found kimi useful in solving creative coding problems gemini, claude, chatgpt etc failed at. Or, it is far better at verifying, augmenting and adding to human reviews of resumes for positions. It catches missed detials humans and other llm's routinley miss. There is something special to K2.
  
  Reply View | 0 replies

extr a day ago

I tried this today. It's good - but it was significantly less focused and reliable than Opus 4.5 at implementing some mostly-fleshed-out specs I had lying around for some needed modifications to an enterprise TS node/express service. I was a bit disappointed tbh, the speed via fireworks.ai is great, they're doing great work on the hosting side. But I found the model had to double-back to fix type issues, broken tests, etc, far more than Opus 4.5 which churned through the tasks with almost zero errors. In fact, I gave the resulting code to Opus, simply said it looked "sloppy" and Opus cleaned it up very quickly.

Reply View 0 replies

Imanari a day ago

I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.

Reply View 0 replies

tomaskafka 20 hours ago

It is amazing, but "open source model" means "model I can understand and modify" (= all the training data and processes).

Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can't modify it."

Reply View 2 replies

jmiskovic 18 hours ago

That's unfair. Binary driver blobs are blackmail: "you bought the hardware, but parts of the laptop won't work unless you agree to run this mysterious bundle insecurely". Open weight is more like "here's a frozen brain you can thaw in a safe harness to do your bidding".

Reply View | 0 replies
pama 18 hours ago

Not equivalent to the binary driver: you can modify it yourself with post training on your own data. So it sits somewhere between NVIDIA userspace drivers and Emacs, or Clade Code and codex-cli. We don’t have good analogies from older generation software.

Reply View | 0 replies

zzleeper a day ago

Do any of these models do well with information retrieval and reasoning from text?

I'm reading newspaper articles through a MoE of gemini3flash and gpt5mini, and what made it hard to use open models (at the time) was a lack of support for pydantic.

Reply View 1 reply

jychang a day ago

That roughly correlates with tool calling capabilities. Kimi K2.5 is a lot better than previous open source models in that regard.
You should try out K2.5 for your use case, it might actually succeed where previous generation open source models failed.

Reply View | 0 replies

eager_learner a day ago

I tried Kimi 2.5 Swarm Agent version and it was way better than any AI model I've tried so far.

Reply View 0 replies

logicprog a day ago

Kimi K2T was good. This model is outstanding, based on the time I've had to test it (basically since it came out). It's so good at following my instructions, staying on task, and not getting context poisoned. I don't use Claude or GPT, so I can't say how good it is compared to them, but it's definitely head and shoulders above the open weight competitors

Reply View 0 replies

niyikiza 15 hours ago

The Agent Swarm section is fascinating. I'm working on authorization for multi-agent systems so this is relevant to my interests. Lots of interesting parallels to capability-based security models.

Reply View 0 replies

throwaway12345t 19 hours ago

Is there a reasonable place to run the unquantized version of this for less than Claude or OpenAI?

It seems to be priced the same and if it’s being hosted somewhere vs run locally it’s still a worse model, the only advantage would be it is not Anthropic or OpenAI.

Reply View 0 replies

derac 2 days ago

I really like the agent swarm thing, is it possible to use that functionality with OpenCode or is that a Kimi CLI specific thing? Does the agent need to be aware of the capability?

Reply View 3 replies

zeroxfe 2 days ago

It seems to work with OpenCode, but I can't tell exactly what's going on -- I was super impressed when OpenCode presented me with a UI to switch the view between different sub-agents. I don't know if OpenCode is aware of the capability, or the model is really good at telling the harness how to spawn sub-agents or execute parallel tool calls.

Reply View | 0 replies
esafak a day ago

Has anyone tried it and decided it's worth the cost; I've heard it's even more profligate with tokens?

Reply View | 1 reply
- swyx a day ago
  
  Yes. https://x.com/swyx/status/2016381014483075561?s=20 it's not crazy, they cap it to 3 credits, and also YSK agent swarm is a closed source product
  Would i use it a gain compared to Deep Research products elsewhere? Maybe, probably not but only bc it's hard to switch apps
  
  Reply View | 0 replies

epolanski 2 days ago

It's interesting to note that a model that can OpenAI is valued almost 400 times more than moonshotai, despite their models being surprisingly close.

Reply View 6 replies

famouswaffles a day ago

OpenAI is a household name with nearly a billion weekly active users. Not sure there's any reality where they wouldn't be valued much more than Kimi regardless of how close the models may be.

Reply View | 0 replies
m3kw9 a day ago

Unless they can beat their capabilities by a clear magical step up and has infrastructure to capture the users

Reply View | 0 replies
moffkalast 2 days ago

Well to be the devil's advocate: One is a household name that holds most of the world's silicon wafers for ransom, and the other sounds like a crypto scam. Also estimating valuation of Chinese companies is sort of nonsense when they're all effectively state owned.

Reply View | 3 replies
- epolanski a day ago
  
  There isn't a single % that is state owned in Moonshot AI.
  And don't start me with the "yeah but if the PRC" because it's gross when US can de facto ban and impose conditions even on European companies, let alone the control it has on US ones.
  
  Reply View | 2 replies
  
  moffkalast a day ago
  
  I'm not sure if that is accurate, most of the funding they've got is from Tencent and Alibaba, and we know what happened to Jack Ma the second he went against the party line. These two are defacto state owned enterprises. Moonshot is unlikely to be for sale in any meaningful way so its valuation is moot.
  [0] https://en.wikipedia.org/wiki/Moonshot_AI#Funding_and_invest...
  
  Reply View | 0 replies
  
  swyx a day ago
  
  Funny because that's how us Americans feel about your European cookie banner litter and unilateral demands on privacy
  
  Reply View | 0 replies

satyambnsal 16 hours ago

I've been using kimi 2.5 to write Rust code and plan out detailed features. so far its brillient.

Reply View 0 replies

tallesborges92 a day ago

I’ve added the api key support to kimi on my agentic coding: https://github.com/tallesborges/zdx

Reply View 1 reply

man4 20 hours ago

[dead]

Reply View | 0 replies

miroljub 2 days ago

I've been quite satisfied lately with MiniMax M-2.1 in opencode.

How does Kimi 2.5 compare to it in real world scenarios?

Reply View 3 replies

viraptor 2 days ago

A lot better in my experience. M2.1 to me feels between haiku and sonnet. K2.5 feels close to opus. That's based on my testing of removing some code and getting it to reimplement based on tests. Also the design/spec writing feels great. You can still test k2.5 for free in OpenCode today.

Reply View | 2 replies
- miroljub 2 days ago
  
  Well, Minimax was the equivalent of Sonnet in my testing. If Kimi approach Opus, that would be great.
  
  Reply View | 1 reply
  
  samtheprogram a day ago
  
  Kimi K2.5 approaches Sonnet as well from what I can tell, it's just slower to get to the result.
  
  Reply View | 0 replies

oxqbldpxo a day ago

This Kimi K2 is so far the best. Gemini is also great, but google is stock in the academic bias of Stanford and MIT and can't think outside the box. China definitely ahead on Ai. Wish somehow someone here in the US, would think different.

Reply View 2 replies

dfsegoat a day ago

> but google is stock in the academic bias of Stanford and MIT and can't think outside the box
Can you clarify what you mean? I am not sure I follow.

Reply View | 1 reply
- JSR_FDED a day ago
  
  s/stock/stuck/
  
  Reply View | 0 replies

sreekanth850 a day ago

Calude give 100% passmark for code generated by kimi and sometimes it say, its better than what claude proposed. Absolutely best os model.

Reply View 0 replies

threethirtytwo a day ago

When will hardware get cheap enough so people can run this locally? That’s the world I’m waiting for.

Reply View 1 reply

vanviegen a day ago

2042. But by then you won't want to run this model anymore.

Reply View | 0 replies

margorczynski 2 days ago

I wonder how K2.5 + OpenCode compares to Opus with CC. If it is close I would let go of my subscription, as probably a lot of people.

Reply View 7 replies

eknkc 2 days ago

It is not opus. It is good, works really fast and suprisingly through about its decisions. However I've seen it hallucinate things.
Just today I asked for a code review and it flagged a method that can be `static`. The problem is it was already static. That kind of stuff never happens with Opus 4.5 as far as I can tell.
Also, in an opencode Plan mode (read only). It generated a plan and instead of presenting it and stopping, decided to implement it. Could not use the edit and write tools because the harness was in read only mode. But it had bash and started using bash to edit stuff. Wouldn't just fucking stop even though the error messages it received from opencode stated why. Its plan and the resulting code was ok so I let it go crazy though...

Reply View | 2 replies
- [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies
- esafak a day ago
  
  Some models have a mind of their own. I keep them on a leash with `permission` blocks in OC -- especially for rm/mv/git.
  
  Reply View | 0 replies
naragon 2 days ago

I've been using K2.5 with OpenCode to do code assessments/fixes and Opus 4.5 with CC to check the work, and so far so good. Very impressed with it so far, but I don't feel comfortable canceling my Claude subscription just yet. Haven't tried it on large feature implementations.

Reply View | 0 replies
ithkuil 2 days ago

I also wonder if CC can be used with k2.5 with the appropriate API adapter

Reply View | 1 reply
- tjuene a day ago
  
  yes, just use the base url https://api.moonshot.ai/anthropic
  (https://platform.moonshot.ai/docs/guide/agent-support#config...)
  
  Reply View | 0 replies
jauntywundrkind a day ago

I've been drafting plans/specs in parallel with Opus and Kimi. Then asking them to review the others plan.
I still find Opus is "sharper" technically, tackles problems more completely & gets the nuance.
But man Kimi k2.5 can write. Even if I don't have a big problem description, just a bunch of specs, Kimi is there, writing good intro material, having good text that more than elaborates, that actually explains. Opus, GLM-4.7 have both complemented Kimi on it's writing.
Still mainly using my z.ai glm-4.7 subscription for the work, so I don't know how capable it really is. But I do tend to go for some Opus in sticky spots, and especially given the 9x price difference, I should try some Kimi. I wish I was set up for better parallel evaluation; feels like such a pain to get started.

Reply View | 0 replies

storus a day ago

Do I need to have two M3U 512GB MacStudios to run this?

Reply View 0 replies

syndacks a day ago

How do people evaluate creative writing and emotional intelligence in LLMs? Most benchmarks seem to focus on reasoning or correctness, which feels orthogonal. I’ve been playing with Kimmy K 2.5 and it feels much stronger on voice and emotional grounding, but I don’t know how to measure that beyond human judgment.

Reply View 2 replies

nolist_policy a day ago

https://eqbench.com/index.html

Reply View | 0 replies
mohsen1 a day ago

I am trying! https://mafia-arena.com
I just don't have enough funding to do a ton of tests

Reply View | 0 replies

christkv 18 hours ago

I wonder if there will be smaller version of the model that can run on Strix Halo in 128GB.

Reply View 0 replies

gedy a day ago

Sorry if this is an easy-answerable question - but by open we can download this and use totally offline if now or in the future if we have hardware capable? Seems like a great thing to archive if the world falls apart (said half-jokingly)

Reply View 7 replies

fancy_pantser a day ago

Sure. Someone on /r/LocalLLaMA was seeing 12.5 tokens/s on dual Strix Halo 128GB machines (run you $6-8K total?) with 1.8bits per parameter. It performs far below the unquantized model, so it would not be my personal pick for a one-local-LLM-forever, but it is compelling because it has image and video understanding. You lose those features if you choose, say, gpt-oss-120B.
Also, that's with no context, so it would be slower as it filled (I don't think K2.5 uses the Kimi-Linear KDA attention mechanism, so it's sub-quadratic but not their lowest).

Reply View | 0 replies
Tepix a day ago

You could buy five Strix Halo systems at $2000 each, network them and run it.
Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.

Reply View | 2 replies
- j-bos a day ago
  
  Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.
  
  Reply View | 1 reply
  
  Tepix a day ago
  
  Llama.cpp with its rpc-server
  
  Reply View | 0 replies
fragmede a day ago

Yes but the hardware to run it decently gonna cost you north of $100k, so hopefully you and your bunkermates allocated the right amount to this instead of guns or ammo.

Reply View | 0 replies
Carrok a day ago

Yes.

Reply View | 0 replies
cmrdporcupine a day ago

Yes, but you'll need some pretty massive hardware.

Reply View | 0 replies

llmslave 2 days ago

The benchmarks on all these models are meaningless

Reply View 2 replies

alchemist1e9 2 days ago

Why and what would a good benchmark look like?

Reply View | 1 reply
- moffkalast 2 days ago
  
  30 people trying out all models on the list for their use case for a week and then checking what they're still using a month after.
  
  Reply View | 0 replies

cmrdporcupine a day ago

DeepSeek is likely to release a new model soon, and judging from the past it's likely to be more cost effective and just as or more powerful than Kimi 2.5.

DeepSeek 3.2 was already quite compelling. I expect its successor will be competitive.

Reply View 0 replies

behnamoh 2 days ago

It's a decent model but works best with kimi CLI, not CC or others.

Reply View 4 replies

rubslopes a day ago

I haven't use Kimi CLI, but it works very well with OpenCode.

Reply View | 0 replies
alansaber 2 days ago

Why do you think that is?

Reply View | 2 replies
- chillacy 2 days ago
  
  I heard it's because the labs fine tune their models for their own harness. Same reason why claude does better in claude code than cursor.
  
  Reply View | 0 replies
- segmondy a day ago
  
  read the tech report
  
  Reply View | 0 replies

firasd a day ago

I'm not very familiar with this model series but tried it when prototyping a potential 'Vibesbench' (my eval) conversation I tried it. Here's Kimi on a Robocop scene

kimi k2 thinking turbo

"He’s not being poetic; he’s weaponizing nostalgia. Quoting Casablanca—the ultimate “beautiful sacrifice” romance—to taunt Morton about corporate backstabbing is pure executive sleaze. It’s the same move as his mahogany desk and flag pin: draping sentimental Americana over naked power. He’s essentially saying, “Isn’t it romantic, Bob? You fought the good fight, now you die.” "

grok 4.1

"So when Dick Jones smirks and says, “Time goes by, it’s the same old story… the fight for love and glory, huh, Bob?” that’s pure Ed Neumeier flexing his love of old Hollywood while simultaneously mocking how these psychopathic ’80s execs think they’re in some grand cinematic drama instead of just murdering a guy for market share."

chatgpt-4o

"The reason the Casablanca reference matters is RoboCop itself is partially a noir—and also a tragedy built inside a genre engine. [… ] So yeah, calling back to Casablanca is not just some meta-joke—it’s a knowing wink that RoboCop is a noir trapped in cybernetic skin. // Even Clarence Boddicker, for all his cartoonish sadism, plays like a twisted stand-in for a noir heavy—talking smooth, dropping nicknames, throwing out casual violence like it’s a punchline."

opus 4.5

"Jones is performing avuncular menace. He's the kind of executive who'd quote old movies while having you murdered, because it's all performance to him, all style. The substance is just power."

Reply View 0 replies

tonychang430 a day ago

Love to see Open source models doing better than SOTA

Reply View 0 replies