Comment by phillipcarter

Comment by phillipcarter 6 months ago

4 replies

I disagree that it's a bad criterion. The case you describe is what sounds difficult, treating one error as part of normal operations and another as not. That should be considered its own kind of error or other form of response, and sampling decisions could take that into consideration (or not).

jeffbee 6 months ago

Another reason against inflating sampling rates on errors is: for system stability you never want to do more stuff during errors than you would normally do. Doing something more expensive during an error can cause your whole system, or elements of it, to latch into an unplanned operating point where they only have the capacity to do the expensive error path, and all of the traffic is throwing errors because of the resource starvation.

  • hamandcheese 6 months ago

    It can also be expensive as in money. Especially if you are a Datadog customer.

  • phillipcarter 6 months ago

    I mean, this is why you offload data elsewhere to handle things like sampling and filtering and aggregation.

amir_jak 6 months ago

You can use the OTel Collector for sampling decisions over tracing, it can also be used for reducing log cost before data is sent to Datadog. There's a whole category of telemetry pipeline now for fully managing that (full disclosure, I work for https://www.sawmills.ai which is a smart telemetry management platform)