Comment by avtar
> anything involving alerting/monitoring from logs is bad idea for future you
Why is issuing alerts for log events a bad idea?
> anything involving alerting/monitoring from logs is bad idea for future you
Why is issuing alerts for log events a bad idea?
Same with metrics.
If you need an artifact from your system, it should be tested. We test our logs and many types of metrics. Too many incidents from logs or metrics changing and no longer causing alerts. Never got to build out my alert test bed that exercises all know alerts in prod, verifying they continue to work.
Couple of reasons.
Biggest one, sample rate is much higher (every log) and this can cause problems if service goes haywire and starts spewing logs everywhere. Logging pipelines tend to be very rigid as well for various reasons. Metrics are easier to handle as you can step back sample rate, drop certain metrics or spin up additional Prometheus instances.
Logging format becomes very rigid and if the company goes multiple languages, this can be problematic as different languages can behave differently. Is this exception something we care about or not? So we throw more code in attempt to get logging alerting into state that does not drive everyone crazy where if we were just doing "rate(critical_errors[5m] > 10" in Prometheus, we would be all set!
It’s trivial to alter or remove log lines without knowing or realizing that it affects some alerting or monitoring somewhere. That’s why there are dedicated monitoring and alerting systems to start with.