Comment by robertk

The Apple paper does not look at its own data — the model outputs become short past some thresholds because the models reflectively realize they do not have the context to respond in the steps as requested, and suggest a Python program instead, just as a human would. One of the penalized environments is proven impossible to solve in the literature for n>6, seemingly unaware to the authors. I consider this and more the definitive rebuttal of the sloppiness of the paper: https://www.alignmentforum.org/posts/5uw26uDdFbFQgKzih/bewar...