Comment by potatolicious
Comment by potatolicious 3 days ago
What you're getting at is the heart of the problem with the LLM hype train though, isn't it?
"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.
So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.
This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.
It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.
It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.