Comment by bradly 2 days ago

Noticing trends in logs ,especially across different systems, could be useful for identifying a lot problems before they break.

I think this is the ideal spot. I wouldn’t trust an agent doing remediation but I would gladly take input from them regarding services that are acting up or have a trend to disaster. This would improve using my time in what matters instead of cleaning data to determine what matters. In a perfect world I’d like to provide my threshold or what success mean to the agent and expect information without me tweaking dashboards, configuring alerts, etc.

    I've worked on large FAANG systems that used ML models to identify all sort of things from hair in pictures to fraud and account takeovers. Tasks are then queue up for human review. AI wasn't a buzzword at the time so we didn't call it that, but I'm guessing this would be similar.

That already exists, though? Log anomalies is a thing.

For all the grief people give Datadog, it is an incredibly good product. Unfortunately, they know it, and charge accordingly.

Splunk has had this for close to two decades.

And I’ve worked on some of the world’s largest systems and in most cases simply looking for the words: error, exception etc is enough for parsing through the logs.

For everything else you need systems like Datadog to visually show you issues e.g. service connection failures.

    Even then, you generally need someone smart enough to figure out what is causing the error.

    Node is throwing EADDRINFO. Why? Well, it's DNS likely but cause of that DNS failure is pretty varied. I've seen DNS Server is offline, firewall is blocking TCP Port 53, bad hostname in config, team took service offline and so forth.

FWIW this is part of the promise of AIOps of the 2010s and it's still only barely starting to happen. I'm glad it has, but there's so, so far to go here.