Comment by timhigins
> LLM could hallucinate
The job of any context retrieval system is to retrieve the relevant info for the task so the LLM doesn't hallucinate. Maybe build a benchmark based on less-known external libraries with test cases that can check the output is correct (or with a mocking layer to know that the LLM-generated code calls roughly the correct functions).
Thanks for the feedback. This will be my next step. Personally I feel it's hard to design those test cases (by myself)