Comment by iandanforth

Comment by iandanforth 7 months ago

I applaud this effort, however the "Does it work?" section answers the wrong question. Anyone can write a trivial doc compressor and show a graph saying "The compressed version is smaller!"

For this to "work" you need to have a metric that shows that AIs perform as well, or nearly as well, as with the uncompressed documentation on a wide range of tasks.

marv1nnnnn 7 months ago

I totally agreed with your critic. To be honest, it's even hard for myself to evaluate. What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works. But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver. I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well

Reply View 6 replies

ricardobeat 7 months ago

> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver
You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.

Reply View | 0 replies
willvarfar 7 months ago

Dual run.
Run the same questions against a model with the unminified and the minified and show the results side-by-side and see how, in your subjective opinion, they hold up.

Reply View | 0 replies
eden-u4 7 months ago

why don't you ask the model about the shrinked system prompt and the original system prompt? in this way you can infer whether the same relevant informations are "stored" in the hidden state of the model.
Or better yet, check directly the hidden state difference between a model feed with the original prompt and one with the shrinked prompt.
This should avoid remove the randomness of the results.

Reply View | 0 replies
rybosome 7 months ago

To be honest with you, it being stochastic is exactly why you should post it.
Having data is how we learn and build intuition. If your experiments showed that modern LLMs were able to succeed more often when given the llm-min file, then that’s an interesting result even if all that was measured was “did the LLM do the task”.
Such a result would raise a lot of interesting questions and ideas, like about the possibility of SKF increasing the model’s ability to apply new information.

Reply View | 0 replies
timhigins 7 months ago

> LLM could hallucinate
The job of any context retrieval system is to retrieve the relevant info for the task so the LLM doesn't hallucinate. Maybe build a benchmark based on less-known external libraries with test cases that can check the output is correct (or with a mocking layer to know that the LLM-generated code calls roughly the correct functions).

Reply View | 1 reply
- marv1nnnnn 7 months ago
  
  Thanks for the feedback. This will be my next step. Personally I feel it's hard to design those test cases (by myself)
  
  Reply View | 0 replies

SparkyMcUnicorn 7 months ago

It's also missing the documentation part. Without additional context, method/type definitions with a short description will only go so far.

Cherry picking a tiny example, this wouldn't capture the fact that cloudflare durable objects can only have one alarm at a time and each set overwrites the old one. The model will happily architect something with a single object, expecting to be able to set a bunch of alarms on it. Maybe I'm wrong and this tool would document it correctly into a description. But this is just a small example.

For much of a framework or library, maybe this works. But I feel like (in order for this to be most effective) the proposed spec possibly needs an update to include little more context.

I hope this matures and works well. And there's nothing stopping me from filling in gaps with additional docs, so I'll be giving it a shot.

Reply View 0 replies

klntsky 7 months ago

Shameless plug: I'm working on a public contest website for prompts compressing other prompts.

It will include evaluations and a public scoreboard.

It's not usable rn, but feel free to follow: https://github.com/klntsky/prompt-compression-contest/

Reply View 0 replies

enjoylife 7 months ago

Was going to point this out too. One suggestion would be to try this on libraries having recent major semvar bumps. See if the compressed docs do better on the backwards incompatible changes.

Reply View 0 replies

rco8786 7 months ago

Yea I was disappointed to see that they just punted (or opted not to show?) on benchmarks.

Reply View 0 replies