Comment by marv1nnnnn
Comment by marv1nnnnn 7 months ago
I totally agreed with your critic. To be honest, it's even hard for myself to evaluate. What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works. But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver. I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well
> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver
You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.