Comment by sheepscreek
Comment by sheepscreek 3 days ago
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
Comment by sheepscreek 3 days ago
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up