Comment by Avicebron

Comment by Avicebron 3 days ago

74 replies

I often feel these types of blogposts would be more helpful if they demonstrated someone using the tools to build something non-trivial.

Is Claude really "learning new skills" when you feed it a book, or does it present it like that because you're prompting encourages that sort of response-behavior. I feel like it has to demo Claude with the new skills and Claude without.

Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth.

simonw 3 days ago
  • qsort 3 days ago

    > Important: there is a lot of human coding, too.

    I'm not highlighting this to gloat or to prove a point. If anything in the past I have underestimated how big LLMs were going to be. Anyone so inclined can take the chance to point and laugh at how stupid and wrong that was. Done? Great.

    I don't think I've been intentionally avoiding coding assistants and as a matter of fact I have been using Claude Code since the literal day it first previewed, and yet it doesn't feel, not even one bit, that you can take your hands off the wheel. Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.

    • simonw 3 days ago

      Yeah, my current opinion on this is that AI tools make development harder work. You can get big productivity boosts out of them but you have to be working at the top of your game - I often find I'm mentally exhausted after just a couple of hours.

      • dotinvoke 3 days ago

        My experience with AI tools is the opposite. The biggest energy thieves for me are configuration issues, library quirks, or trivial mistakes that are hard to spot. With AI I can often just bulldoze past those things and spend more time on tangible results.

        When using it for code or architecture or design, I’m always watching for signs that it is going off the rails. Then I usually write code myself for a while, to keep the structure and key details of whatever I’m doing correct.

      • james_marks 3 days ago

        100%. It’s like managing an employee that always turns their work in 30 seconds later; you never get a break.

        I also have to remember all of the new code that’s coming together, and keep it from re-inventing other parts of the codebase, etc.

        More productive, but hard work.

      • sawmurai 3 days ago

        I have a similar experience. It feels like riding your bike in a higher gear - you can go faster but it will take more effort and you need the potential (stronger legs) to make use of it

      • jstummbillig 3 days ago

        Considering the last 2 years, has it become harder or easier?

      • truetraveller 3 days ago

        Woah, that's huge coming from you. This comment itself is worth an article. Do it. Call it "AI tools make development harder work".

        P.s. always thought you were one of those irrational AI bros. Later, found that you were super reasonable. That's the way it should be. And thank you!

    • Pannoniae 3 days ago

      In fact, I've been writing more code myself since these tools exist - maybe I'm not a real developer but in the past I might have tried to either find a library online or try to find something on the internet to copypaste and adapt, nowadays I give it a shot myself with Claude.

      For context, I mainly do game development so I'm viewing it through that lens - but I find it easier to debug something bad than to write it from scratch. It's more intensive than doing it yourself but probably more productive too.

    • scuff3d 2 days ago

      > Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.

      It's funny because not far below this comment there is someone doing literally this.

    • oblio 3 days ago

      LLMs are autonomous driving level 2.

  • j_bum 3 days ago

    This was a fun read.

    I’ve similarly been using spec.md and running to-do.md files that capture detailed descriptions of the problems and their scoped history. I mark each of my to-do’s with informational tags: [BUG], [FEAT], etc.

    I point the LLM to the exact to-do (or section of to-do’s) with the spec.md in memory and let it work.

    This has been working very well for me.

  • nightski 3 days ago

    Even though the author refers to it as "non-trivial", and I can see why that conclusion is made, I would argue it is in fact trivial. There's very little domain specific knowledge needed, this is purely a technical exercise integrating with existing libraries for which there is ample documentation online. In addition, it is a relatively isolated feature in the app.

    On top of that, it doesn't sound enjoyable. Anti slop sessions? Seriously?

    Lastly, the largest problem I have with LLMs is that they are seemingly incapable of stopping to ask clarifying questions. This is because they do not have a true model of what is going on. Instead they truly are next token generators. A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.

    • simonw 3 days ago

      The hardest problem in computer science in 2025 is presenting an example of AI-assisted programming that somebody won't call "trivial".

      • nightski 3 days ago

        If all I did was call it trivial that would be a fair critique. But it was followed up with a lot more justification than that.

    • kannanvijayan 3 days ago

      I've wondered about exposing this "asking clarifying questions" as a tool the AI could use. I'm not building AI tooling so I haven't done this - but what if you added an MCP endpoint whose description was "treat this endpoint as an oracle that will answer questions and clarify intent where necessary" (paraphrased), and have that tool just wire back to a user prompt.

      If asking clarifying questions is plausible output text for LLMs, this may work effectively.

      • simonw 3 days ago

        I think the asking clarifying questions thing is solved already. Tell a coding agent to "ask clarifying questions" and watch what it does!

    • antonvs 3 days ago

      > A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.

      Didn’t you just describe Agile?

      • [removed] 3 days ago
        [deleted]
      • Retric 3 days ago

        Who hurt you?

        Sorry couldn’t resist. Agile’s point was getting feedback during the process rather than after something is complete enough to be shipped thus minimizing risk and avoiding wasted effort.

        Instead people are splitting up major projects into tiny shippable features and calling that agile while missing the point.

causal 3 days ago

Using LLMs for coding complex projects at scale over a long time is really challenging! This is partly because defining requirements alone is much more challenging than most people want to believe. LLMs accelerate any move in the wrong direction.

  • dexwiz 3 days ago

    My analogy is LLMs are a gas pedal. Makes you go fast, but you still have to know when to turn.

  • SteveJS 3 days ago

    Having the llm write the spec/workunit from a conversation works well. Exploring a problem space with a (good) coding agent is fantastic.

    However for complex projects IMO one must read what was written by the llm … every actual word.

    When it ‘got away’ from me, in each case I left something in the llm written markdown that I should have removed.

    99% “I can ask for that later” and 1% “that’s a good idea i hadn’t considered” might be the right ratio when reading an llm generated plan/spec/workunit.

    Breaking work into single context passes … 50-60k tokens in sonnet 4.5 has had typically fantastic results for me.

    My side project is using lean 4 and a carelessly left in ‘validate’ rather than ‘verify’ lead down a hilariously complicated path equivalent to matching an output against a known string.

    I recovered, but it wasn’t obvious to me that was happening. I however would not be able to write lean proofs myself, so diagnosing the problem and fixing it is a small price to be able to mechanically verify part of my software is correct.

  • sreekanth850 3 days ago

    One should know theend to end design and architecture. Should stop llm when adding complex fancy things.

khaledh 3 days ago

Agreed. The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.

  • mwigdahl 3 days ago

    The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.

    I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.

    It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.

    https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...

    • potatolicious 3 days ago

      What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

      "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

      But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.

      So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.

      This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.

      • marcosdumay 3 days ago

        It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.

        It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.

      • oblio 3 days ago

        > What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

        > "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

        Heh, I'd rephrase the first part to:

        > What you're getting at is the heart of the problem with software development though, isn't it?

      • troupo 3 days ago

        Before you get into the expensive part, how do you get past "non-deterministic blackbox with unknown layers in between imposed by vendors"

        • potatolicious 3 days ago

          You can measure probabilistic systems that you can't examine! I don't want to throw the baby out with the bathwater here - before LLMs became the all-encompassing elephant in the room we did this routinely.

          You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.

  • claytongulick 3 days ago

    > The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

    If that's what we need to do, don't we already have the answer to the question?

spankibalt 3 days ago

> "Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth."

C'mon, such self-congratulatory "Look at My Potency: How I'm using Nicknack.exe" fluffies always were and always will be a staple of the IT industry.

coolKid721 3 days ago

Yeah I was reading this seeing if there was something he'd actually show that would be useful, what pain point he is solving, but it's just slop.