Comment by CamperBob2

Comment by CamperBob2 16 hours ago

5 replies

Flash is the wrong model for questions like that -- not that you care -- but if you'd like to share the actual prompt you gave it, I'll try it in 2.5 Pro.

alganet 16 hours ago

"explain me the difference between the short ternary operator and the Elvis operator"

When it failed, I replied: "in PHP".

You don't seem to understand what I'm trying to say and instead is trying to defend LLMs for a fault that is a fact known in the industry at large.

I'm sure that in short time, I could make 2.5 Pro hallucinate as well. If not on this question, on others.

This behavior is inline with the paper conclusions:

> many models are not able to reliably estimate their own limitations.

(see Figure 3, they tested a variety of models of different qualities).

This is the kind of question a junior developer can answer with simple google searches, or by reading the PHP manual, or just by testing it on a REPL. Why do we need a fancy model in order to answer such a simple inquiry? Would a beginner know that the answer is incorrect and he should use a different model?

Also, from the paper:

> For very relevant topics, the answers that models provide are wrong.

> Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry.

That's true for programming as well. It outperforms the average human, but then it makes silly mistakes that could confuse beginners. It displays confidence in being plain wrong.

The study also used manually curated questions for evaluation, so my prompt is not some dirty trick. It's totally inline with the context of this discussion.

  • CamperBob2 15 hours ago

    It's better than it was a year ago, as you'd have discovered for yourself if you used current models. Nothing else matters.

    See if this looks any better (I don't know PHP): https://g.co/gemini/share/7849517fdb89

    If it doesn't, what specifically is incorrect?

    • alganet 13 hours ago

      What I expect from a human is to ask "in which language?", because it makes a difference. If no language was supplied, I expect a brief summary of null coalescing and shorthand ternary options with useful examples in the most popular languages.

      --

      The JavaScript example should have mentioned the use of `||` (or operator) to achieve the same effect of a shorthand ternary. It's common knowledge.

      In PHP specifically, `??` allows you to null coalesce array keys and other types of complex objects. You don't need to write `isset($arr[1]) ? $arr[1] : "ipsum"`, you can just `$arr[1] ?? "ipsum"`. TypeScript has it too and I would expect anyone answering about JavaScript to mention that, since it's highly relevant for the ecosystem.

      Also in PHP, there is the `?:` that is similar to what `||` does in JavaScript in an assignment context, but due to type juggling, it can act as a null coalesce operator too (although not for arrays or complex types).

      The PHP example they present, therefore, is plain wrong and would lead to a warning for trying to access an unset array key. Something that the `??` operator (not mentioned in the response) would solve.

      I would go as far as explaining null conditional acessors as well `$foo?->bar` or `foo?.bar`. Those are often called Elvis operators coloquially and fall within the same overall problem-solving category.

      The LLM answer is a dangerous mix of incomplete and wrong. It could lead a beginner to adopt an old bad practice, or leave a beginner without a more thorough explanation. Worst of all, the LLM makes those mistakes with confidence.

      --

      What I think is going on is that null handling is such a basic task, that programmers learn it in the first few years of their careers and almost never write about it. There's no need to. I'm sure a code-completion LLM can code using those operators effectively, but LLMs cannot talk about them consistently. They'll only get better at it if we get better at it, and we often don't need to write about it.

      In this particular elvis operator thing, there has been no significant improvement in the correctedness of the answer in __more than 2 whole years__. Samples from ChatGPT in 2023 (note my image date): https://imgur.com/UztTTYQ https://imgur.com/nsqY2rH.

      So, _for some things_, contrary to what you suggested before, LLMs are not getting that much better.

      • CamperBob2 10 hours ago

        Having read the reply in 2.5 Pro, I have to agree with you there. I'm surprised it whiffed on those details. They are fairly basic and rather important. It could have provided a better answer (I fed your reply back to it at https://g.co/gemini/share/7f87b5e9d699 ), but it did a crappy job deciding what to include in its initial response.

        I don't agree that you can pick one cherry example and use it to illustrate anything general about the progress of the models in general, though. There are far too many counterexamples to enumerate.

        (Actually I suspect what will happen is that we'll change the way we write documentation to make it easy for LLMs to assimilate. I know I'm already doing that myself.)

        • alganet 4 hours ago

          > I don't agree that you can pick one cherry example

          Benchmarks and evaluations are made of cherry picked examples. What makes my example invalid, and benchmark prompts valid? (it's a rethorical question, you don't need to answer).

          > write documentation to make it easy for LLMs to assimilate.

          If we ever do that, it means LLMs failed at their job. They are supposed to help and understand us, not the other way around.