Comment by jeffbee

Comment by jeffbee 3 days ago

7 replies

I feel like this is a lesson that unfortunately did not escape Google, even though a lot of these open systems came from Google or ex-Googlers. The overhead of tracing, logs, and metrics needs to be ultra-low. But the (mis)feature whereby a trace span can be sampled post hoc means that you cannot have a nil tracer that does nothing on unsampled traces, because it could become sampled later. And the idea that if a metric exists it must be centrally collected is totally preposterous, makes everything far too expensive when all a developer wants is a metric that costs nothing in the steady state but can be collected when needed.

mamidon 3 days ago

How would you handle the case where you want to trace 100% of errors? Presumably you don't know a trace is an error until after you've executed the thing and paid the price.

  • phillipcarter 3 days ago

    This is correct. It's a seemingly simple desire -- "always capture whenever there's a request with an error!" -- but the overhead needed to set that up gets complex. And then you start heading down the path of "well THESE business conditions are more important than THOSE business conditions!" and before you know it, you've got a nice little tower of sampling cards assembled. It's still worth it, just a hefty tax at times, and often the right solution is to just pay for more compute and data so that your engineers are spending less time on these meta-level concerns.

  • jeffbee 3 days ago

    I wouldn't. "Trace contains an error" is a hideously bad criterion for sampling. If you have some storage subsystem where you always hedge/race reads to two replicas then cancel the request of the losing replica, then all of your traces will contain an error. It is a genuinely terrible feature.

    Local logging of error conditions is the way to go. And I mean local, not to a central, indexed log search engine; that's also way too expensive.

    • phillipcarter 3 days ago

      I disagree that it's a bad criterion. The case you describe is what sounds difficult, treating one error as part of normal operations and another as not. That should be considered its own kind of error or other form of response, and sampling decisions could take that into consideration (or not).

      • jeffbee 3 days ago

        Another reason against inflating sampling rates on errors is: for system stability you never want to do more stuff during errors than you would normally do. Doing something more expensive during an error can cause your whole system, or elements of it, to latch into an unplanned operating point where they only have the capacity to do the expensive error path, and all of the traffic is throwing errors because of the resource starvation.

        • hamandcheese 3 days ago

          It can also be expensive as in money. Especially if you are a Datadog customer.

      • amir_jak 3 days ago

        You can use the OTel Collector for sampling decisions over tracing, it can also be used for reducing log cost before data is sent to Datadog. There's a whole category of telemetry pipeline now for fully managing that (full disclosure, I work for https://www.sawmills.ai which is a smart telemetry management platform)