Comment by stackskipton
Comment by stackskipton 17 hours ago
Couple of reasons.
Biggest one, sample rate is much higher (every log) and this can cause problems if service goes haywire and starts spewing logs everywhere. Logging pipelines tend to be very rigid as well for various reasons. Metrics are easier to handle as you can step back sample rate, drop certain metrics or spin up additional Prometheus instances.
Logging format becomes very rigid and if the company goes multiple languages, this can be problematic as different languages can behave differently. Is this exception something we care about or not? So we throw more code in attempt to get logging alerting into state that does not drive everyone crazy where if we were just doing "rate(critical_errors[5m] > 10" in Prometheus, we would be all set!