Comment by aoeusnth1

Comment by aoeusnth1 18 hours ago

Are you intentionally sandbagging the LLMs to prove a point, or do you really think 4o-mini is good enough for programming?

Even 2.5 flash easily gets this https://imgur.com/a/OfW30eL

alganet 18 hours ago

The point is that I can make them hallucinate quite easily. And they don't demonstrate knowing their own limitations.

For example, 2.5 Flash fails to explain the difference between the short ternary operator (null coalescing) and the Elvis operator.

https://imgur.com/a/xKjuoqV

Even when I specify a language (therefore clearing the confusion, supposedly), it still fails to even recognize the Elvis operator by its toupe, and mixes it up the explanation (it doesn't even understand what I asked).

https://imgur.com/a/itr87hM

So, the point I'm trying to make is that they're not any better for programming than they're for chemistry.

Reply View 6 replies

CamperBob2 17 hours ago

Flash is the wrong model for questions like that -- not that you care -- but if you'd like to share the actual prompt you gave it, I'll try it in 2.5 Pro.

Reply View | 5 replies
- alganet 17 hours ago
  
  "explain me the difference between the short ternary operator and the Elvis operator"
  When it failed, I replied: "in PHP".
  You don't seem to understand what I'm trying to say and instead is trying to defend LLMs for a fault that is a fact known in the industry at large.
  I'm sure that in short time, I could make 2.5 Pro hallucinate as well. If not on this question, on others.
  This behavior is inline with the paper conclusions:
  > many models are not able to reliably estimate their own limitations.
  (see Figure 3, they tested a variety of models of different qualities).
  This is the kind of question a junior developer can answer with simple google searches, or by reading the PHP manual, or just by testing it on a REPL. Why do we need a fancy model in order to answer such a simple inquiry? Would a beginner know that the answer is incorrect and he should use a different model?
  Also, from the paper:
  > For very relevant topics, the answers that models provide are wrong.
  > Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry.
  That's true for programming as well. It outperforms the average human, but then it makes silly mistakes that could confuse beginners. It displays confidence in being plain wrong.
  The study also used manually curated questions for evaluation, so my prompt is not some dirty trick. It's totally inline with the context of this discussion.
  
  Reply View | 4 replies
  
  CamperBob2 17 hours ago
  
  It's better than it was a year ago, as you'd have discovered for yourself if you used current models. Nothing else matters.
  See if this looks any better (I don't know PHP): https://g.co/gemini/share/7849517fdb89
  If it doesn't, what specifically is incorrect?
  
  Reply View | 3 replies

CamperBob2 18 hours ago

They aren't getting any better at programming, so they naturally assume the LLMs aren't, either.

Reply View 0 replies