Comment by overgard

Comment by overgard 2 days ago

51 replies

I asked Codex to write some unit tests for Redux today. At first glance it looked fine, and I continued on. I then went back to add a test by hand, and after looking more closely at the output there were like 50 wtf worthy things scattered in there. Sure they ran, but it was bad in all sorts of ways. And this was just writing something very basic.

This has been my experience almost every time I use AI: superficially it seems fine, once I go to extend the code I realize it's a disaster and I have to clean it up.

The problem with "code is cheap" is that, it's not. GENERATING code is now cheap (while the LLMs are subsidized by endless VC dollars, anyway), but the cost of owning that code is not. Every line of code is a liability, and generating thousands of lines a day is like running up a few thousand dollars of debt on a credit card thinking you're getting free stuff and then being surprised when it gets declined.

acedTrex 2 days ago

I've always said every line is a liability, its our job to limit liabilities. That has largely gone out the window these days.

  • ky3 a day ago

    EWD 1036: On the cruelty of really teaching computing science (1988)

    “My point today is that, if we wish to count lines of code, we should not regard them as ‘lines produced’ but as ‘lines spent’: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.”

  • elgenie a day ago

    No code is as easy to maintain as no code.

    No code runs as fast as no code.

  • doug_durham a day ago

    A better formulation is "every feature is a liability". Taking it to the line of code level is too prescriptive. Occasionally writing more verbose code is preferable if it makes it easier to understand.

    • catdog a day ago

      > A better formulation is "every feature is a liability". Taking it to the line of code level is too prescriptive.

      Amount of code is a huge factor but maybe not the best wording here. It's more a thing of complexity where amount of code is a contributing metric but not the only one. You can very easily have a feature implemented in a too complex way and with too much code (esp. if an LLM generated the code but also with human developers). Also not every feature is equal.

      > Occasionally writing more verbose code is preferable if it makes it easier to understand.

      Think this is more a classic case of "if the metric becomes a goal it ceases to be a metric" than it being a bad metric per se.

    • phi-go a day ago

      This sounds wrong, features have to be the value of your code. The required maintenance and slow down to build more features (technical debt) are the liability, which is how I understood the relationship to "lines of code" anyway.

      • TeMPOraL a day ago

        Wrong or not, the industry embraced it.

        I can sort of understand it if I squint: every feature is a maintenance burden, and a risk of looking bad in front of users when you break or remove it, even if those users didn't use this feature. It's really a burden to be avoided when the point of your product is to grow its user base, not to actually be useful. Which explains why even Fischer-Price toys look more feature-ful and ergonomic than most new software products.

  • ramraj07 a day ago

    Agreed, every line you ship,whether you wrote it or not, you are responsible. In that regard, while I write a lot of code completely with AI, I still endeavor to keep the lines as minimal as possible. This means you never write both the main code and the tests using AI. Id rather have no tests than AI tests (we have QA team writing that up). This kinda works.

  • cs_sorcerer 2 days ago

    Followed by even better is code is no code, and best is deleting code.

    It’s one of those things which has always strikes me funny about programming is how less usually really is more

  • chr15m 2 days ago

    > every line is a liability, its our job to limit liabilities.

    Hard agree!

    • ares623 a day ago

      But more code from AI means stocks go up. Stocks are assets. If you generate enough code the assets will outnumber the liabilities. It’s accounting 101. /s

  • nomel 2 days ago

    The only people I've known that share this perspective are those that hate abstraction. Going back to their code, to extend it in some way, almost always requires a rewrite, because they wrote it with the goal of minimum viable complexity rather than understanding the realities of the real world problem they're solving, like "we all know we need these other features, but we have a deadline!"

    For one off, this is fine. For anything maintainable, that needs to survive the realities of time, this is truly terrible.

    Related, my friend works in a performance critical space. He can't use abstractions, because the direct, bare metal, "exact fit" implementation will perform best. They can't really add features, because it'll throw the timing of others things off to much, so usually have to re-architect. But, that's the reality of their problem space.

    • johnmwilkinson 2 days ago

      I believe this is conflating abstraction with encapsulation. The former is about semantic levels, the later about information hiding.

      • nomel 2 days ago

        Maybe I am? How is it possible to abstract without encapsulation? And also, how is it possible to encapsulate without abstracting some concept (intentionally or not) contained in that encapsulation? I can't really differentiate them, in the context of naming/referencing some list of CPU operations.

    • jahsome a day ago

      I don't see how the two are related, personally. I'm regularly accused of over-abstraction specifically because I aspire to make each abstraction do as little as possible, i.e. fewest lines possible.

      • galaxyLogic a day ago

        "Abstracting" means extracting the commnon parts of multiple instances, and making everything else a parameter. The difficulty for software is that developers often start by writing the abstraction, rather than having multiple existing instances and then writing code that collects the common parts of those multiple instances into a single abstraction. I guess that is what "refactoring" is about.

        In sciences and humanities abstraction is applied the proper way, studying the instances first then describing multitude of existing phenomena by giving names to their common repeating descriptions.

      • nomel a day ago

        I call that lasagna code! From what I've seen, developers start with spaghetti, overcompensate with lasagna, then end up with some organization more optimized for the human, that minimizes cognitive load while reading.

        To me, abstraction is an encapsulation of some concept. I can't understand how they're practically different, unless you encapsulate true nonsense, without purpose or resulting meaning, which I can't think of an example of, since humans tend to categorize/name everything. I'm dumb.

acemarke a day ago

Hi, I'm the primary Redux maintainer. I'd love to see some examples of what got generated! (Doubt there's anything we could do to _influence_ this, but curious what happened here.)

FWIW we do have our docs on testing approaches here, and have recommended a more integrated-style approach to testing for a while:

- https://redux.js.org/usage/writing-tests

visarga a day ago

> "write some unit tests for Redux today"

The equivalent of "draw me a dog" -> not a masterpiece!? who would have thought? You need to come up with a testing methodology, write it down, and then ask the model to go through it. It likes to make assumptions on unspecified things, so you got to be careful.

More fundamentally I think testing is becoming the core component we need to think about. We should not vibe-check AI code, we should code-check it. Of course it will write the actual test code, but your main priority is to think about "how do I test this?"

You can only know the value of a code up to the level of its testing. You can't commit your eyes into the repo, so don't do "LGTM" vibe-testing of AI code, it's walking a motorcycle.

akst 2 days ago

ATM I feel like LLM writing tests can be a bit dangerous at times, there are cases where it's fine there are cases where it's not. I don't really think I could articulate a systemised basis for identifying either case, but I know it when I see it I guess.

Like the the other day, I gave it a bunch of use cases to write tests for, the use cases were correct the code was not, it saw one of the tests broken so it sought to rewrite the test. You risking suboptimal results when an agent is dictating its own success criteria.

At one point I did try and use seperate Claude instances to write tests, then I'd get the other instance to write the implementation unaware of the tests. But it's a bit to much setup.

  • icedchai 19 hours ago

    I work with individuals who attempt to use LLMs to write tests. More than once, it's added nonsensical, useless test cases. Admittedly, humans do this, too, to a lesser extent.

    Additionally, if their code has broken existing tests, it "fixes" them by not fixing the code under test, but changing the tests... (assert status == 200 becomes 500 and deleting code.)

    Tests "pass." PR is opened. Reviewers wade through slop...

    • sigotirandolas 3 hours ago

      The most annoying thing is that even after cleaning up all the nonsense, the tests still contain all sort of fanfare and it’s essentially impossible to get the submitter to trim them because it’s death by a thousand cuts (and you better not say "do it as if you didn’t use AI" in the current climate..)

sjsizjhaha 2 days ago

Generating code was always cheap. That’s part of the reason this tech has to be forced on teams. Similar to the move to cloud, it’s the kind of cost that’s only gonna show up later - faster than the cloud move, I think. Though, in some cases it will be the correct choice.

cheema33 2 days ago

This is how you do things if you are new to this game.

Get two other, different, LLMs to thoroughly review the code. If you don’t have an automated way to do all of this, you will struggle and eventually put yourself out of a job.

If you do use this approach, you will get code that is better than what most software devs put out. And that gives you a good base to work with if you need to add polish to it.

  • overgard a day ago

    I actually have used other LLMs to review the code, in the past (not today, but in the past). It's fine, but it doesn't tend to catch things like "this technically works but it's loading a footgun." For example, the redux test I was mentioning in my original post, the tests were reusing a single global store variable. It technically worked, the tests ran, and since these were the first tests I introduced in the code base there weren't any issues even though this made the tests non deterministic... but, it was a pattern that was easily going to break down the line.

    To me, the solution isn't "more AI", it's "how do I use AI in a way that doesn't screw me over a few weeks/months down the line", and for me that's by making sure I understand the code it generated and trim out the things that are bad/excessive. If it's generating things I don't understand, then I need to understand them, because I have to debug it at some point.

    Also, in this case it was just some unit tests, so who cares, but if this was a service that was publicly exposed on the web? I would definitely want to make sure I had a human in the loop for anything security related, and I would ABSOLUTELY want to make sure I understood it if it were handling user data.

    • cstejerean a day ago

      how long ago was this past? A review with latest models should absolutely catch the issue you describe, in my experience.

      • t_mahmood a day ago

        Ah, "It's work on my computer" edition of LLM.

      • overgard 16 hours ago

        December. Previous job had cursor and copilot automatically reviewing PRs.

  • timcobb a day ago

    > you will struggle and eventually put yourself out of a job.

    We can have a discussion without the stakes being so high.

  • summerlight a day ago

    The quality of generated code does not matter. The problem is when it breaks 2 AM and you're burning thousands of dollars every minutes. You don't own the code that you don't understand, but unfortunately that does not mean you don't own the responsibility as well. Good luck on writing the postmortem, your boss will have lots of question for you.

    • icedchai 19 hours ago

      Frequently the boss is encouraging use of AI for efficiency without understanding the implications.

      And we'll just have the AI write the postmortem, so no big deal there. ;)

    • charcircuit a day ago

      AI can help you understand code faster than without AI. It allows me to investigate problems that I have little context in and be able to write fixes effectively.

  • lelanthran a day ago

    > If you do use this approach, you will get code that is better than what most software devs put out. And that gives you a good base to work with if you need to add polish to it.

    If you do use this approach, you'll find that it will descend into a recursive madness. Due to the way these models are trained, they are never going to look at the output of two other models and go "Yeah, this is fine as it is; don't change a thing".

    Before you know it you're going to have change amplification, where a tiny change by one model triggers other models (or even itself) to make other changes, which triggers further changes, etc ad nauseum.

    The easy part is getting the models to spit out working code. The hard part is getting it to stop.

  • 3kkdd a day ago

    Im sick and tired of these empty posts.

    SHOW AN EXAMPLE OF YOU ACTUALLY DOING WHAT YOU SAY!

    • alt187 a day ago

      There's no example because OP has never done this, and never will. People lie on the internet.

      • timcobb a day ago

        I've never done this because i haven't felt compelled to do this because I want to review my own code but I imagine this works okay and isn't hard to set up by asking Claude to set this up for you...

      • senordevnyc 12 hours ago

        What? People do this all the time. Sometimes manually by invoking another agent with a different model and asking it to review the changes against the original spec. I just setup some reviewer / verifier sub agents in Cursor that I can invoke with a slash command. I use Opus 4.5 as my daily driver, but I have reviewer subagents running Gemini 3 Pro and GPT-5.2-codex and they each review the plan as well, and then the final implementation against the plan. Both sometimes identify issues, and Opus then integrates that feedback.

        It’s not perfect so I still review the code myself, but it helps decrease the number of defects I have to then have the AI correct.

    • Foreignborn a day ago

      these two posts (the parent and then the OP) seem equally empty?

      by level of compute spend, it might look like:

      - ask an LLM in the same query/thread to write code AND tests (not good)

      - ask the LLM in different threads (meh)

      - ask the LLM in a separate thread to critique said tests (too brittle, testing guidelines, testing implementation and not out behavior, etc). fix those. (decent)

      - ask the LLM to spawn multiple agents to review the code and tests. Fix those. Spawn agents to critique again. Fix again.

      - Do the same as above, but spawn agents from different families (so Claude calls Gemini and Codex).

      —-

      these are usually set up as /slash commands like /tests or /review so you aren’t doing this manually. since this can take some time, people might work on multiple features at once.

zamalek 2 days ago

The main issue I've seen is it writing passing tests, the code being correct is a big (and often incorrect) assumption.

pmarreck a day ago

nobody is going to care about your bespoke codework if there is no downstream measurable difference

hahahahhaah a day ago

Living off only deep fried potatoes and large cola bottles is cheap...