Comment by orochimaaru

Comment by orochimaaru 6 months ago

3 replies

Metrics are usually minimal overheard. Traces need to be sampled. Logs need to be sampled at error/critical levels. You also need to be able to dynamically change sampling and log levels.

100% traces are a mess. I didn’t see where he setup sampling.

phillipcarter 6 months ago

The post didn't cover sampling, which indeed, significantly reduces overhead in OTel because the spans that aren't sampled aren't ever created, when you head sample at the SDK level. This is more of a concern when doing tail-based sampling only, wherein you will want to trace each request and offload to a sidecar so that export concerns are handled outside your app. And then it routes to a sampler elsewhere in your infrastructure.

FWIW at my former employer we had some fairly loose guidelines for folks around sampling: https://docs.honeycomb.io/manage-data-volume/sample/guidelin...

There's outliers, but the general idea is that there's also a high cost to implementing sampling (especially for nontrivial stuff), and if your volume isn't terribly high then you'll probably eat a lot more in time than paying for the extra data you may not necessarily need.

  • nikolay_sivko 6 months ago

    As suggested, I measured the overhead at various sampling rates:

    No instrumentation (otel is not initialized): CPU=2.0 cores

    SAMPLING 0% (otel initialized): CPU=2.2 cores

    SAMPLING 10%: CPU=2.5 cores

    SAMPLING 50%: CPU=2.6 cores

    SAMPLING 100%: CPU=2.9 cores

    Even with 0% sampling, OpenTelemetry still adds overhead due to context propagation, span creation, and instrumentation hooks