Comment by simianwords
Comment by simianwords 20 hours ago
Honest question: why is this not enough?
If the code passes tests, and also works at the functionality level - what difference does it make if you’ve read the code or not?
You could come up with pathological cases like: it passed the tests by deleting them. And the code written by it is extremely messy.
But we know that LLMs are way smarter than this. There’s very very low chance of this happening and even if it does - it quick glance at code can fix it.
You can't test everything. The input space may be infinite. The app may feel janky. You can't even be sure you're testing all that can be tested.
The code may seem to work functionally on day 1. Will it continue to seem to work on day 30? Most often it doesn't.
And in my experience, the chances of LLMs fucking up are hardly very very low. Maybe it's a skill issue on my part, but it's also the case that the spec is sometimes discovered as the app is being built. I'm sure this is not the case if you're essentially summoning up code that exists in the test set, even if the LLM has to port it from another language, and they can be useful in parts here and there. But turning the controls over to the infinite monkey machine has not worked out for me so far.