Comment by Thaxll

Comment by Thaxll 3 days ago

14 replies

Logging, metrics and traces are not free, especially if you turn them on at every requests.

Tracing every http 200 at 10k req/sec is not something you should be doing, at that rate you should sample 200 ( 1% or so ) and trace all the errors.

kiitos 3 days ago

> Tracing every http 200 at 10k req/sec is not something you should be doing

You don't know if a request is HTTP 200 or HTTP 500 until it ends, so you have to at least collect trace data for every request as it executes. You can decide whether or not to emit trace data for a request based on its ultimate response code, but emission is gonna be out-of-band of the request lifecycle, and (in any reasonable implementation) amortized such that you really shouldn't need to care about sampling based on outcome. That is, the cost of collection is >> the cost of emission.

If your tracing system can't handle 100% of your traffic, that's a problem in that system; it's definitely not any kind of universal truth... !

anonzzzies 3 days ago

A very small % of startups gets anywhere near that traffic so why give them angst? Most people can just do this without any issues and learn from it and a tiny fraction shouldn't.

  • williamdclt 3 days ago

    10k/s across multiple services is reached quickly even at startup scale.

    In my previous company (startup), we’d use Otel everywhere and we definitely needed sampling for cost reasons (1/30 iirc). And that was using a much cheaper provider than Datadog

  • cogman10 3 days ago

    Having high req/s isn't as big a negative as it once was. Especially if you are using http2 or http3.

    Designing APIs which cause a high number of requests and spit out a low amount of data can be quite legitimate. It allows for better scaling and capacity planning vs having single calls that take a large amount of time and return large amounts of data.

    In the old http1 days, it was a bad thing because a single connection could only service 1 request at a time. Getting any sort of concurrency or high request rates require many connections (which had a large amount of overhead due to the way tcp functions).

    We've moved past that.

orochimaaru 3 days ago

Metrics are usually minimal overheard. Traces need to be sampled. Logs need to be sampled at error/critical levels. You also need to be able to dynamically change sampling and log levels.

100% traces are a mess. I didn’t see where he setup sampling.

  • phillipcarter 3 days ago

    The post didn't cover sampling, which indeed, significantly reduces overhead in OTel because the spans that aren't sampled aren't ever created, when you head sample at the SDK level. This is more of a concern when doing tail-based sampling only, wherein you will want to trace each request and offload to a sidecar so that export concerns are handled outside your app. And then it routes to a sampler elsewhere in your infrastructure.

    FWIW at my former employer we had some fairly loose guidelines for folks around sampling: https://docs.honeycomb.io/manage-data-volume/sample/guidelin...

    There's outliers, but the general idea is that there's also a high cost to implementing sampling (especially for nontrivial stuff), and if your volume isn't terribly high then you'll probably eat a lot more in time than paying for the extra data you may not necessarily need.

    • nikolay_sivko 10 hours ago

      As suggested, I measured the overhead at various sampling rates:

      No instrumentation (otel is not initialized): CPU=2.0 cores

      SAMPLING 0% (otel initialized): CPU=2.2 cores

      SAMPLING 10%: CPU=2.5 cores

      SAMPLING 50%: CPU=2.6 cores

      SAMPLING 100%: CPU=2.9 cores

      Even with 0% sampling, OpenTelemetry still adds overhead due to context propagation, span creation, and instrumentation hooks

kubectl_h 3 days ago

You have to do the tracing anyway if you are going to sample based on criteria that isn't available at the beginning of the trace (like an error that occurs later in the request) and tail sample. You can head sample of course, but that's going to be the most coarse sampling you can do and you can't sample based on anything but the initial conditions of the trace.

What we have started doing is still tracing every unit of work, but deciding at the root span the level of instrumentation fidelity we want for the trace based on the initial conditions. Spans are still generated in the lifecycle of the trace, but we discard them at the processor level (before they are batched and sent to the collector) unless they have errors on them or the trace has been marked as "full fidelity".

jhoechtl 3 days ago

I am relatively new to the topic. In the sample code of the OP there is no logging right? It's metrics and traces but no logging.

How is logging in OTel?

  • shanemhansen 3 days ago

    To me traces (or maybe more specifically spans) are essentially a structured log with a unique ID and a reference to a parent ID.

    Very open to have someone explain why I'm wrong or why they should be handled separately.

    • kiitos 3 days ago

      Traces have a very specific data model, and corresponding limitations, which don't really accommodate log events/messages of arbitrary size. The access model for traces is also fundamentally different vs. that of logs.

      • phillipcarter 3 days ago

        There are practical limitations mostly with backend analysis tools. OTel does not define a limit on how large a span is. It’s quite common in LLM Observability to capture full prompts and LLM responses as attributes on spans, for example.

        • kiitos 2 days ago

          > There are practical limitations mostly with backend analysis tools

          Not just end-of-line analysis tools, but also initiating SDKs, and system agents, and intermediate middle-boxes -- really anything that needs to parse OTel.

          Spec > SDK > Trace > Span limits: https://opentelemetry.io/docs/specs/otel/trace/sdk/#span-lim...

          Spec > Common > Attribute limits: https://opentelemetry.io/docs/specs/otel/common/#attribute-l...

          I know the spec says the default AttributeValueLengthLimit = infinity, but...

          > It’s quite common in LLM Observability to capture full prompts and LLM responses as attributes on spans, for example.

          ...I'd love to learn about any OTel-compatible pipeline/system that supports attribute values of arbitrary size! because I've personally not seen anything that lets you get bigger than O(1MB).

  • phillipcarter 3 days ago

    Logging in OTel is logging with your logging framework of choice. The SDK just requires you initialize the wrapper and it’ll then wrap your existing logging calls and correlate term with a trace/span in active context, if it exists. There is no separate logging API to learn. Logs are exported in a separate pipeline from traces and metrics.

    Implementation for many languages are starting to mature, too.