Comment by zug_zug

Comment by zug_zug 10 months ago

24 replies

I cannot for the life of me understand why SRE of all roles would be the one to attempt to use agents for. IMO it's one of the last roles that it would apply, long after core development.

I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

As an aside - this is garbage attempt at an article, kinda saying nothing.

sgarland 10 months ago

Because the people creating and selling these solutions are charlatans who have probably never debugged a gnarly Linux issue; more importantly, they’re selling it to people who also don’t know any better.

SRE as a term is effectively meaningless at this point. I can count on one hand the number of SRE coworkers I’ve had who were worth a damn. Most only know how to make Grafana dashboards and glue TF modules together.

  • tayo42 10 months ago

    Sre shouldn't be a job. If you look at what an sre aspires to be, it's just good software engineering.

    It's glorified sysadmin work, and the role tbh in the industry should have stayed that way.

    If your senior swes can't monitor, add metrics, design fault tolerant systems, debug the system and environment their stuff runs on and need to baby sat there's a problem

    • sgarland 10 months ago

      What you’re describing is mostly in the application side of things, though. You still need Ops to run all the infra, which is what most places’ SRE role really is.

      Most devs I’ve worked with didn’t have any interest in dealing with infra. I have no problem with that; while I think anyone working with computers should learn at least some basics about them, I completely understand and agree with specialization.

      • tayo42 10 months ago

        I haven't seen an ops team run infra in awhile. They're all software engineers writing abstraction over cloud. Infra is like "platform" now and they are swes running it.

        • stackskipton 10 months ago

          I mean, I'm SRE/Ops but every company I've joined, I've been brought in to manage the infrastructure from ball of wax some dev got operational enough to limp into prod. When I mean limp, they are truly limping and couldn't run to save their life.

          You need someone to be disconnected from getting birthday to display on the settings page and make sure Bingo, Papaya, Raccoon and Omega Star are just up and running. Esp since Omega Star can't get their shit together.

blackjack_ 10 months ago

Exactly, because the major thing every company needs is an opaque agent messing with all the production settings and releases that impact revenue and uptime? This is one of the silliest pitches I've heard in my life.

I'm all for a k8s / Terraform / etc focused GPT trained on my workspace to help me remember that one weird custom Terraform module we wrote three years ago, but I don't want it implementing system wide changes without feedback.

bradly 10 months ago

Noticing trends in logs ,especially across different systems, could be useful for identifying a lot problems before they break.

  • clvx 10 months ago

    I think this is the ideal spot. I wouldn’t trust an agent doing remediation but I would gladly take input from them regarding services that are acting up or have a trend to disaster. This would improve using my time in what matters instead of cleaning data to determine what matters. In a perfect world I’d like to provide my threshold or what success mean to the agent and expect information without me tweaking dashboards, configuring alerts, etc.

    • bradly 10 months ago

      I've worked on large FAANG systems that used ML models to identify all sort of things from hair in pictures to fraud and account takeovers. Tasks are then queue up for human review. AI wasn't a buzzword at the time so we didn't call it that, but I'm guessing this would be similar.

  • sgarland 10 months ago

    That already exists, though? Log anomalies is a thing.

    For all the grief people give Datadog, it is an incredibly good product. Unfortunately, they know it, and charge accordingly.

  • threeseed 10 months ago

    Splunk has had this for close to two decades.

    And I’ve worked on some of the world’s largest systems and in most cases simply looking for the words: error, exception etc is enough for parsing through the logs.

    For everything else you need systems like Datadog to visually show you issues e.g. service connection failures.

    • stackskipton 10 months ago

      Even then, you generally need someone smart enough to figure out what is causing the error.

      Node is throwing EADDRINFO. Why? Well, it's DNS likely but cause of that DNS failure is pretty varied. I've seen DNS Server is offline, firewall is blocking TCP Port 53, bad hostname in config, team took service offline and so forth.

  • phillipcarter 10 months ago

    FWIW this is part of the promise of AIOps of the 2010s and it's still only barely starting to happen. I'm glad it has, but there's so, so far to go here.

ramesh31 10 months ago

>I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

Yes. That's exactly where this stuff is going right now.

wilson090 10 months ago

Agents do not necessarily need to do the entire job of an SRE. They can deliver tremendous value by doing portions of the job. i.e. bringing in relevant data for a given alert or writing post-mortems. There are aspects of the role that are ripe for LLM-powered tools

  • threeseed 10 months ago

    I really can’t think of anything more counter-productive than AI post-mortems.

    The whole point of them is for the team to understand what were wrong and how processes can be improved in order to prevent issues happening again. The idea of throwing away all the detail and nuance and having some AI generated summary will only make the SRE field worse.

    Also really don’t understand the benefit of LLM for bringing in relevant data. Would much prefer that be statically defined i.e. when a database query takes 10x longer, bring me the dashboards for OS and hardware as well.

    • [removed] 10 months ago
      [deleted]
mullingitover 10 months ago

> As an aside - this is garbage attempt at an article, kinda saying nothing.

Disagree. It was a good survey of the current AI SRE agents landscape. I don't have my head in the sand, but there are new startups coming up that I hadn't heard about.

>I mean is the AI going to...

Yes. Not one AI but multiple AI agents, all working on different aspects of the problem autonomously and then coming together where needed to collaborate. Single LLMs doing things by themselves are like concrete, creating swarms of agents is like adding rebar to the mix.

gmuslera 10 months ago

As with Asimov's The Last Question, there will be insufficient data for a meaningful course of action. But we tend to cut corners, be convinced with some synthetic use cases and take the risk without being aware that they are so. The mistakes will be dismissed as attributed to misuses, or weird installations, and the successes will fuel the hype.

moandcompany 10 months ago

They're going to need SRE for the AI :)

  • Cyph0n 10 months ago

    Nah it’s just AI SRE agents all the way down.

    • wruza 10 months ago

      Except you don’t need down, you can simply loop them. The quality of such system will measure in how much it can do in a particular compute budget and theoretically can even self-stabilize by roi feedback.

      • 6510 10 months ago

        Even the most novice coder with a poor brain can write all the software if you give them a million years.

more_corn 10 months ago

I mean, doesn’t all that sound like something fine tuned LLMS would be great at? Well constrained tasks, if else statements, runbook -> automation.