Comment by refulgentis

Comment by refulgentis 4 days ago

3 replies

See peer reply re: yes, your self-chosen benchmark has been reached.

Generally, I've learned to warn myself off of a take when I start writing emotionally charged stuff like [1]. Without any prompting (who mentioned apps? and why would you without checking?), also, when reading minds, and assigning weak arguments, now and in my imagination of the future. [2]

At the very least, [2] is a signal to let the keyboard have a rest, and ideally my mind.

Bailey: > "If [there were] new LLMs...consistently solving Erdos problems at rapidly increasing rates then they'd be showing...that"

Motte: > "I can['t] pop into ChatGPT and pop out Erdos proofs regularly"

No less than Terence Tao, a month ago, pointing out your bailey was newly happening with the latest generation: https://mathstodon.xyz/@tao/115788262274999408. Not sure how you only saw one Erdos problem.

[1] "I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up"

[2] "...or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either"

zamadatix 3 days ago

I'm going to stick to the stuff around Tao, as even well tempered discussion about the rest would be against the guidelines anyways.

I had a very different read of Tao's post last month. To me, he opens that there have been many claims of novel solutions which turn out to be known solutions from publications buried for years, but nothing about rapid increase in the rates or even claims mathematicians using LLMs are having most of the work done by them yet.

He speculates, and I also assume correctly as well, that that contaminations are not the only reason. Indeed, we've seen at least 1 novel solution which couldn't have come from a low interest publication being in the training data alone. How many of the 3 examples at the top end up actually falling that way is not really something anyone can know, but I agree it should be safe to assume the answer will not be 0, or even if it was it would seem unreasonable to think it stayed that way. These solutions are coming out of systems of which the LLM is a part, and very often a mathematician still actually orchestrating.

None of these are just popping in a prompt and hoping for the answer, nor will you get an unknown solution to an LLM by going to ChatGPT 5.2 Pro and asking it without the rest of the story (and even then, you still will not get such a solution regularly, consistently, or at a massively higher rate than several months ago). They are multishot from experts with tools. Tao makes a very balanced note of this in reply to his main message:

> The nature of these contributions is rather nuanced; individually and collectively, they do not meet the hyped up goal of AI autonomously solving major mathematical open problems, but they also cannot all be dismissed as inconsequential trickery.

It's exciting, and helpful, but it's slow and he doesn't even think we're truly actually at "AI solves some Erdos problems" yet, let alone "AI solves Erdos problems regularly and at a rapidly increasing rate".

  • refulgentis 3 days ago

    "...as even well tempered discussion about the rest would be against the guidelines anyways."

    Didn't bother reading after that. I deeply respect you have the self-awareness to notice and spare us, that's rare. But it also means we all have to have conversations purely on your terms, and because its async, the rules constantly change post-hoc.

    And that's on top of the post-hoc motte / bailey instances, of which we have multiple. I was stunned (stunned!!) by the attempted retcon of the app claim once there were numbers.

    Anyways, all your bete noirs aside, all your Red Team vs. Blue Team signalling aside, using LMArena alone as a benchmark is a bad idea.

    • zamadatix an hour ago

      The conversation is certainly not on "my terms" as I didn't write the guidelines (nor do they benefit me more than anyone else). If you are genuinely concerned with the conversation, please flag it and/or email hn@ycombinator.com and they will (genuinely) handle it appropriately. Otherwise there is not much which can be said around this.

      If not, continuing to have a conversation can only happen if we want to discuss the recent growth rate of AI. Similarly, async conversation can be as clear and consistent as we want it to feel - we just have to take the time to ask for clarification before writing a response on something we feel is a movable understanding.

      I also agree nobody should rely solely on LM Arena for benchmarks, which is not what starting a conversation with an example from it was meant to imply people should try to do. I'd love to continue chatting more about how you see Tao's comments, as you seem to have walked away from reading them with a very different understanding than I did.