Comment by CharlesW
> When I ask Claude to find bugs in my 20kloc C library it more or less just splits the file(s) into smaller chunks and greps for specific code patterns and in the end just gives me a list of my own FIXME comments (lol), which tbh is quite underwhelming - a simple bash script could do that too.
Here's a technique that often works well for me: When you get unexpectedly poor results, ask the LLM what it thinks an effective prompt would look like, e.g. "How would you prompt Claude Code to create a plan to effectively review code for logic bugs, ignoring things like FIXME and TODO comments?"
The resulting prompt is too long to quote, but you can see the raw result here: https://gist.github.com/CharlesWiltgen/ef21b97fd4ffc2f08560f...
From there, you can make any needed improvements, turn it into an agent, etc.
I've found this a really useful strategy in many situations when working with LLMS. It seems odd that it works, since one one think its ability to give a good reply to such a question means it already "understands" your intent in the first place, but that's just projecting human ability onto LLMS. I would guess this technique is similar to how reasoning modes seems to improve output quality, though I may misunderstand how reasoning modes work.