Comment by potatolicious

Comment by potatolicious 3 days ago

What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.

So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.

This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.

marcosdumay 3 days ago

It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.

It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.

Reply View 0 replies

oblio 3 days ago

> What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

Heh, I'd rephrase the first part to:

> What you're getting at is the heart of the problem with software development though, isn't it?

Reply View 0 replies

simonw 3 days ago

The UK government ran a study with thousands of developers quite recently: https://www.gov.uk/government/publications/ai-coding-assista...

Reply View 5 replies

redhale 2 days ago

I don't necessarily think the conclusions are wrong, but this relies entirely on self-reported survey results to measure productivity gains. That's too easy to poke holes in, and I think studies like this are unlikely to convince real skeptics in the near term.

Reply View | 3 replies
- simonw 2 days ago
  
  At this point it's becoming clear from threads similar to this one that quite a lot of the skeptics are actively working not to be convinced by anything.
  
  Reply View | 2 replies
  
  redhale 2 days ago
  
  Do you have a study to back that up? /s
  I agree. I think there are too many resources, examples, and live streams out there for someone to credibly claim at this point that these tools have no value and are all hype. I think the nuance is in how and where you apply it, what your expectations and tolerances are, and what your working style is. They are bad at many things, but there is tremendous value to be discovered. The loudest people on both sides of this debate are typically wrong in similar ways imo.
  
  Reply View | 1 reply
  
  subjectivationx a day ago
  
  I am not a software engineer but I am using my own vibe coded video efx software, my own vibe coded audio synth, my own vibe coded art generator for art. These aren't software products though. No one else is ever going to use them. The output is what matters to me. Even I can see that committing LLM generated code at your software job is completely insane. The only way to get a productivity increase is to not bother understanding what the program is doing. If you need to understand what is going on then why not just type it in yourself? My productivity increase is immeasurable because I wouldn't be able to write this video player I made. I have absolutely no idea how it works. It is exactly why I am not a software engineer. Professionals claiming a productivity boost have to be doing something along the lines of not understanding what the program is doing that is proportional to the claimed productivity increase. I don't see how you can have it both ways unless someone is just that slow of a typist.
  
  Reply View | 0 replies
b_e_n_t_o_n 2 days ago

Woah, finally something with actual metrics instead of vibes!
> Trial participants saved an average of 56 minutes a working day when using AICAs
That feels accurate to me, but again I'm just going on vibes :P

Reply View | 0 replies

troupo 3 days ago

Before you get into the expensive part, how do you get past "non-deterministic blackbox with unknown layers in between imposed by vendors"

Reply View 1 reply

potatolicious 3 days ago

You can measure probabilistic systems that you can't examine! I don't want to throw the baby out with the bathwater here - before LLMs became the all-encompassing elephant in the room we did this routinely.
You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.

Reply View | 0 replies