Comment by nullbio

Comment by nullbio 19 hours ago

Anyone else find that despite Gemini performing best on benches, it's actually still far worse than ChatGPT and Claude? It seems to hallucinate nonsense far more frequently than any of the others. Feels like Google just bench maxes all day every day. As for Mistral, hopefully OSS can eat all of their lunch soon enough.

apexalpha 19 hours ago

No, I've been using Gemini for help while learning / building my onprem k8s cluster and it has been almost spotless.

Granted, this is a subject that is very well present in the training data but still.

Reply View 1 reply

Synthetic7346 19 hours ago

I found gemini 3 to be pretty lackluster for setting up an onprem k8s cluster - sonnet 4.5 was more accurate from the get go, required less handholding

Reply View | 0 replies

mvkel 19 hours ago

Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose. Their value is as a structural check on the power of proprietary systems; they guarantee a competitive floor. They’re essential to the ecosystem, but they’re not chasing SOTA.

Reply View 8 replies

barrell 19 hours ago

I can attest to Mistral beating OpenAI in my use cases pretty definitively :)

Reply View | 0 replies
pants2 18 hours ago

> Their value is as a structural check on the power of proprietary systems
Unfortunately that doesn't pay the electricity bill

Reply View | 1 reply
- array_key_first 11 hours ago
  
  It kind of does, because the proprietary systems are unacceptable for many usecases because they are proprietary.
  There's a lot of businesses who do not want to hand over their sensitive data to hackers, employees of their competitors, and various world governments. There's inherent risk in choosing a propreitary option, and that doesn't just go for LLMs. You can get your feet swept up from underneath you.
  
  Reply View | 0 replies
cmrdporcupine 18 hours ago

This may be the case, but DeepSeek 3.2 is "good enough" that it competes well with Sonnet 4 -- maybe 4.5 -- for about 80% of my use cases, at a fraction of the cost.
I feel we're only a year or two away from hitting a plateau with the frontier closed models having diminishing returns vs what's "open"

Reply View | 1 reply
- troyvit 16 hours ago
  
  I think you're right, and I feel the same about Mistral. It's "good enough", super cheap, privacy friendly, and doesn't burn coal by the shovel-full. No need to pay through the nose for the SOTA models just to get wrapped into the same SaaS games that plague the rest of the industry.
  
  Reply View | 0 replies
re-thc 19 hours ago

> Open weight LLMs aren't supposed to "beat" closed models, and they never will. That isn’t their purpose.
Do things ever work that way? What if Google did Open source Gemini. Would you say the same? You never know. There's never "supposed" and "purpose" like that.

Reply View | 2 replies
- lowkey_ 18 hours ago
  
  Not the above poster, but:
  OpenAI went closed (despite open literally being in the name) once they had the advantage. Meta also is going closed now that they've caught up.
  Open-source makes sense to accelerate to catch up, but once ahead, closed will come back to retain advantage.
  
  Reply View | 1 reply
  
  mvkel 12 hours ago
  
  I continue to be surprised that the supposed bastion of "safe" AI, anthropic, has a record of being the least-open AI company
  
  Reply View | 0 replies

gunalx 3 hours ago

Have used gemini3 to GEW shot a few problems GPT5 struggled on.

Reply View 0 replies

dchest 18 hours ago

Nope, Gemini 3 is hallucinating less than GPT-5.1 for my questions.

Reply View 0 replies

mrtksn 19 hours ago

Yep, Gemini is my least favorite and I’m convinced that the hype around it isn’t organic because I don’t see the claimed “superiority”, quite the opposite.

Reply View 1 reply

cmrdporcupine 18 hours ago

I think a lot of the hype around Gemini comes down to people who aren't using it for coding but for other things maybe.
Frankly, I don't actually care about or want "general intelligence" -- I want it to make good code, follow instructions, and find bugs. Gemini wasn't bad at the last bit, but wasn't great at the others.
They're all trying to make general purpose AI, but I just want really smart augmentation / tools.

Reply View | 0 replies

minimaxir 19 hours ago

For noncoding tasks, Gemini atleast allows for easier grounding with Google Search.

Reply View 0 replies

tootie 18 hours ago

No? My recent experience with Gemini was terrific. The last big test I gave of Claude it spun an immaculate web of lies before I forced it to confess.

Reply View 0 replies

VeejayRampay 7 hours ago

no, I find Gemini to be the best

Reply View 0 replies

cmrdporcupine 19 hours ago

I also had bad luck when I finally tried Gemini 3 in the gemini CLI coding tool. I am unclear if it's the model or their bad tooling/prompting. It had, as you said, hallucination problems, and it also had memory issues where it seemed to drop context between prompts here and there.

It's also slower than both Opus 4.5 and Sonnet.

Reply View 0 replies

bluecalm 19 hours ago

My experience is the opposite although I don't use it to write code but to explore/learn about algorithms and various programming ideas. It's amazing. I am close to cancelling my ChatGPT subscription (I would only use Open Router if it had nicer GUI and dark mode anyway).

Reply View 0 replies

llm_nerd 19 hours ago

What does your comment have to do with the submission? What a weird non-sequitur. I even went looking at the linked article to see if it somehow compares with Gemini. It doesn't, and only relates to open models.

In prior posts you oddly attack "Palantir-partnered Anthropic" as well.

Are things that grim at OpenAI that this sort of FUD is necessary? I mean, I know they're doing the whole code red thing, but I guarantee that posting nonsense like this on HN isn't the way.

Reply View 0 replies

alfalfasprout 19 hours ago

If anything it's a testament to human intelligence that benchmarks haven't really been a good measure of a model's competence for some time now. They provide a relative sorting to some degree, within model families, but it feels like we've hit an AI winter.

Reply View 0 replies

moffkalast 18 hours ago

Yes, and likewise with Kimi K2. Despite being on the top of open source benches it makes up more batshit nonsense than even Llama 3.

Trust no one, test your use case yourself is pretty much the only approach, because people either don't run benchmarks correctly or have the incentive not to.

Reply View 0 replies