Comment by Michelangelo11

Comment by Michelangelo11 21 hours ago

3 replies

> Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior. This clear-cut approach significantly simplifies evaluation, as the grading criteria are objective and direct: either the generated implementation behaves identically to the reference, or it doesn’t.

OK, but then you have to produce the detailed specification, working backward from the reference implementation. This is extremely non-trivial and it significantly weakens the TFA's parallels to pre-training, in which you don't need really need inputs other than raw text corpora.

I'm not saying this eliminates the idea outright, but I do think it hobbles it badly.

vessenes 19 hours ago

I’d like to courteously disagree. I think existing models and existing tools are good enough to bootstrap this at least.

I’d propose the following architecture:

Step 1: Microsoft phi style - read code and write specifications using a frontier model. You could use an ensemble here to nitpick the spec; it’s only going to get written once. We also have of course many many rfcs and codebases that conform to them or where they do not we have an existing repository of bug reports, patches, forum complaints, etc.

Step 2-4: implement multilayer evaluation: does it compile? Does an existing model think the code complies with the spec on inspection? When it’s run on qemu are the key evals the same as the original software?

I propose most of steps 2-4 are automatable and rely on existing tooling and provide a framework that is, if not cheap, achievable. I’m also sure someone could improve this plan with a few more minutes of thought.

To me the interesting question is - will this add capabilities at current model sizes? My prior is yes in that the current behemoth size models feel like they are only incrementally better than 1/10 size distills. I interpret that to mean we haven’t gotten the most out of these larger scales. I will note Dario disagrees on this - he’s publicly said we need at least 10x more scale than we have now.

dist-epoch 21 hours ago

The detailed specification is the output for a particular input.

And you can use a fuzzer to augument that.

YetAnotherNick 21 hours ago

When prompted correctly, models could generate good specification in form of pretty exhaustive tests. While all tests have weaknesses and are not formal specification, they could get us 99% there.