Comment by jerpint

Comment by jerpint 3 days ago

I did a post [0] about this last year, and vanilla LLMs didn’t do nearly as well as I’d expected on advent of code, though I’d be curious to try this again with Claude code and codex

[0] https://www.jerpint.io/blog/2024-12-30-advent-of-code-llms/

the_duke 3 days ago

LLMs, and especially coding focused models, have come a very long way in the past year.

The difference when working on larger tasks that require reasoning is night and day.

In theory it would be very interesting to go back and retry the 2024 tasks, but those will likely have ended up in the training data by now...

Reply View 3 replies

crystal_revenge 3 days ago

> LLMs, and especially coding focused models, have come a very long way in the past year.
I see people assert this all over the place, but personally I have decreased my usage of LLMs in the last year. During this change I’ve also increasingly developed the reputation of “the guy who can get things shipped” in my company.
I still use LLMs, and likely always will, but I no longer let them do the bulk of the work and have benefited from it.

Reply View | 0 replies
mbac32768 3 days ago

Last April I asked Claude Sonnet 3.7 to solve AoC 2024 day 3 in x86-64 assembler and it one-shotted solutions for part 1 and 2(!)
It's true this was 4 months after AoC 2024 was out, so it may have been trained on the answer, but I think that's way too soon.
Day 3 in 2024 isn't a Math Olympiad tier problem or anything but it seems novel enough, and my prior experience with LLMs were that they were absolutely atrocious at assembler.
https://adventofcode.com/2024/day/3

Reply View | 1 reply
- paulddraper 3 days ago
  
  Last year, I saw LLMs do well on the first week and accuracy drop off after that.
  But as others have said, it’s a night and day difference now, particularly with code execution.
  
  Reply View | 0 replies

randomifcpfan 3 days ago

Current frontier agents can one shot solve all 2024 AoC puzzles, just by pasting in the puzzle description and the input data.

From watching them work, they read the spec, write the code, run it on the examples, refine the code until it passes, and so on.

But we can’t tell whether the puzzle solutions are in the training data.

I’m looking forward to seeing how well current agents perform on 2025’s puzzles.

Reply View 1 reply

suddenlybananas 2 days ago

They obviously have the puzzles in the training data, why are you acting like this is uncertain?

Reply View | 0 replies