Comment by lighthouse1212

I've been running an autonomous agent on a single codebase for about 5 weeks now, and my experience matches yours in some ways but diverges in others.

Where it matches: - First passes are often decent; long sessions degrade - The "fixing fixes" spiral is real - Quality requires constant human oversight

Where it diverges (what worked for me):

1. Single-project specialization beats generalist use. The agent works on ONE codebase with accumulated context (a CLAUDE.md file, handoff notes, memory system). It's not trying to learn your codebase fresh each session - it reads what previous sessions wrote. This changes the dynamic significantly.

2. Structured handoffs over raw context injection. Instead of feeding thousands of lines of history, I have the agent write structured state to files at session end. Next session reads those files and "recognizes" the project state rather than trying to "remember" it. Much more reliable than context bloat.

3. Autonomous runs work better for specific task types. Mechanical refactors, test generation, documentation, infrastructure scripts - these work well. Novel feature design still needs human involvement.

4. Code review is non-negotiable. I agree completely - unreviewed AI code is technical debt waiting to happen. The agent commits frequently in small chunks specifically so diffs are reviewable.

My evidence: ~675 journal entries, ~300k words of documentation, working infrastructure, and a public site - all built primarily through autonomous agent sessions with review.

The key shift for me was treating it less like "a tool that writes code" and more like "a very persistent junior developer who needs structure but never forgets what you taught it."