Comment by zug_zug

Comment by zug_zug 10 months ago

I cannot for the life of me understand why SRE of all roles would be the one to attempt to use agents for. IMO it's one of the last roles that it would apply, long after core development.

I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

As an aside - this is garbage attempt at an article, kinda saying nothing.

sgarland 10 months ago

Because the people creating and selling these solutions are charlatans who have probably never debugged a gnarly Linux issue; more importantly, they’re selling it to people who also don’t know any better.

SRE as a term is effectively meaningless at this point. I can count on one hand the number of SRE coworkers I’ve had who were worth a damn. Most only know how to make Grafana dashboards and glue TF modules together.

Reply View 4 replies

tayo42 10 months ago

Sre shouldn't be a job. If you look at what an sre aspires to be, it's just good software engineering.
It's glorified sysadmin work, and the role tbh in the industry should have stayed that way.
If your senior swes can't monitor, add metrics, design fault tolerant systems, debug the system and environment their stuff runs on and need to baby sat there's a problem

Reply View | 3 replies
- sgarland 10 months ago
  
  What you’re describing is mostly in the application side of things, though. You still need Ops to run all the infra, which is what most places’ SRE role really is.
  Most devs I’ve worked with didn’t have any interest in dealing with infra. I have no problem with that; while I think anyone working with computers should learn at least some basics about them, I completely understand and agree with specialization.
  
  Reply View | 2 replies
  
  tayo42 10 months ago
  
  I haven't seen an ops team run infra in awhile. They're all software engineers writing abstraction over cloud. Infra is like "platform" now and they are swes running it.
  
  Reply View | 1 reply
  
  stackskipton 10 months ago
  
  I mean, I'm SRE/Ops but every company I've joined, I've been brought in to manage the infrastructure from ball of wax some dev got operational enough to limp into prod. When I mean limp, they are truly limping and couldn't run to save their life.
  You need someone to be disconnected from getting birthday to display on the settings page and make sure Bingo, Papaya, Raccoon and Omega Star are just up and running. Esp since Omega Star can't get their shit together.
  
  Reply View | 0 replies

blackjack_ 10 months ago

Exactly, because the major thing every company needs is an opaque agent messing with all the production settings and releases that impact revenue and uptime? This is one of the silliest pitches I've heard in my life.

I'm all for a k8s / Terraform / etc focused GPT trained on my workspace to help me remember that one weird custom Terraform module we wrote three years ago, but I don't want it implementing system wide changes without feedback.

Reply View 0 replies

bradly 10 months ago

Noticing trends in logs ,especially across different systems, could be useful for identifying a lot problems before they break.

Reply View 6 replies

clvx 10 months ago

I think this is the ideal spot. I wouldn’t trust an agent doing remediation but I would gladly take input from them regarding services that are acting up or have a trend to disaster. This would improve using my time in what matters instead of cleaning data to determine what matters. In a perfect world I’d like to provide my threshold or what success mean to the agent and expect information without me tweaking dashboards, configuring alerts, etc.

Reply View | 1 reply
- bradly 10 months ago
  
  I've worked on large FAANG systems that used ML models to identify all sort of things from hair in pictures to fraud and account takeovers. Tasks are then queue up for human review. AI wasn't a buzzword at the time so we didn't call it that, but I'm guessing this would be similar.
  
  Reply View | 0 replies
sgarland 10 months ago

That already exists, though? Log anomalies is a thing.
For all the grief people give Datadog, it is an incredibly good product. Unfortunately, they know it, and charge accordingly.

Reply View | 0 replies
threeseed 10 months ago

Splunk has had this for close to two decades.
And I’ve worked on some of the world’s largest systems and in most cases simply looking for the words: error, exception etc is enough for parsing through the logs.
For everything else you need systems like Datadog to visually show you issues e.g. service connection failures.

Reply View | 1 reply
- stackskipton 10 months ago
  
  Even then, you generally need someone smart enough to figure out what is causing the error.
  Node is throwing EADDRINFO. Why? Well, it's DNS likely but cause of that DNS failure is pretty varied. I've seen DNS Server is offline, firewall is blocking TCP Port 53, bad hostname in config, team took service offline and so forth.
  
  Reply View | 0 replies
phillipcarter 10 months ago

FWIW this is part of the promise of AIOps of the 2010s and it's still only barely starting to happen. I'm glad it has, but there's so, so far to go here.

Reply View | 0 replies

ramesh31 10 months ago

>I mean is the AI going to read your sourcecode, read all your slack messages for context, login to all your observability tools, run repeated queries, come up with a hypothesis, test it against prod? Then run a blameles retrospective, institute new logging, modify the relevant processes with PRs, and create new alerts to proactively catch the problem?

Yes. That's exactly where this stuff is going right now.

Reply View 0 replies

wilson090 10 months ago

Agents do not necessarily need to do the entire job of an SRE. They can deliver tremendous value by doing portions of the job. i.e. bringing in relevant data for a given alert or writing post-mortems. There are aspects of the role that are ripe for LLM-powered tools

Reply View 2 replies

threeseed 10 months ago

I really can’t think of anything more counter-productive than AI post-mortems.
The whole point of them is for the team to understand what were wrong and how processes can be improved in order to prevent issues happening again. The idea of throwing away all the detail and nuance and having some AI generated summary will only make the SRE field worse.
Also really don’t understand the benefit of LLM for bringing in relevant data. Would much prefer that be statically defined i.e. when a database query takes 10x longer, bring me the dashboards for OS and hardware as well.

Reply View | 1 reply
- [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies

mullingitover 10 months ago

> As an aside - this is garbage attempt at an article, kinda saying nothing.

Disagree. It was a good survey of the current AI SRE agents landscape. I don't have my head in the sand, but there are new startups coming up that I hadn't heard about.

>I mean is the AI going to...

Yes. Not one AI but multiple AI agents, all working on different aspects of the problem autonomously and then coming together where needed to collaborate. Single LLMs doing things by themselves are like concrete, creating swarms of agents is like adding rebar to the mix.

Reply View 0 replies

gmuslera 10 months ago

As with Asimov's The Last Question, there will be insufficient data for a meaningful course of action. But we tend to cut corners, be convinced with some synthetic use cases and take the risk without being aware that they are so. The mistakes will be dismissed as attributed to misuses, or weird installations, and the successes will fuel the hype.

Reply View 0 replies

moandcompany 10 months ago

They're going to need SRE for the AI :)

Reply View 3 replies

Cyph0n 10 months ago

Nah it’s just AI SRE agents all the way down.

Reply View | 2 replies
- wruza 10 months ago
  
  Except you don’t need down, you can simply loop them. The quality of such system will measure in how much it can do in a particular compute budget and theoretically can even self-stabilize by roi feedback.
  
  Reply View | 1 reply
  
  6510 10 months ago
  
  Even the most novice coder with a poor brain can write all the software if you give them a million years.
  
  Reply View | 0 replies

more_corn 10 months ago

I mean, doesn’t all that sound like something fine tuned LLMS would be great at? Well constrained tasks, if else statements, runbook -> automation.

Reply View 0 replies