Comment by hrpnk

Comment by hrpnk a day ago

19 replies

Has anyone seen OTel being used well for long-running batch/async processes? Wonder how the suggestions stack up to monolith builds for Apps that take about an hour.

makeavish a day ago

You can use SpanLinks to analyse your async processes. This guide might be helpful introduction: https://dev.to/clericcoder/mastering-trace-analysis-with-spa...

Also SigNoz supports rendering practically unlimited number of spans in trace detail UI and allows filtering them as well which has been really useful in analyzing batch processes: https://signoz.io/blog/traces-without-limits/

You can further run aggregation on spans to monitor failures and latency.

PS: I am SigNoz maintainer

  • ai-christianson a day ago

    Is this better than Honeycomb?

    • mdaniel a day ago

      "Better" is always "for what metric" but if nothing else having the source code to the stack is always "better" IMHO even if one doesn't choose to self-host, and that goes double for SigNoz choosing a permissive license, so one doesn't have to get lawyers involved to run it

      ---

      While digging into Honeycomb's open source story, I did find these two awesome toys, one relevant to the otel discussion and one just neato

      https://github.com/honeycombio/refinery (Apache 2) -- Refinery is a tail-based sampling proxy and operates at the level of an entire trace. Refinery examines whole traces and intelligently applies sampling decisions to each trace. These decisions determine whether to keep or drop the trace data in the sampled data forwarded to Honeycomb.

      https://github.com/honeycombio/gritql (MIT) -- GritQL is a declarative query language for searching and modifying source code

zdc1 a day ago

I've tried and failed at tracing transactions that span multiple queues (with different backends). At the end I just published some custom metrics for the transaction's success count / failure count / duration and moved on my with life.

sethammons 21 hours ago

We had a hell of a time attempting to roll out OTel for that kind of work. Our scale was also billions of requests per day.

We ended up taking tracing out of these jobs, and only using on requests that finish in short order, like UI web requests. For our longer jobs and fanout work, we started passing a metadata object around that appended timing data related that specific job and then at egress, would capture the timing metadata and flag abnormalities.

dboreham a day ago

It doesn't matter how long things take. The best way to understand this is to realize that OTel tracing (and all other similar things) are really "fancy logging systems". Some agent code emits a log message every time something happens (e.g. batch job begins, batch job ends). Something aggregates those log messages into some place they can be coherently scanned. Then something scans those messages generating some visualization you view. Everything could be done with text messages in text files and some awk script. A tracing system is just that with batteries included and a pretty UI. Understood this way it should now be clear why the duration of a monitored task is not relevant -- once the "begin task" message has been generated all that has to happen is the sampling agent remembers the span ID. Then when the "end task" message is emitted it has the same span ID. That way the two can be correlated and rendered as a task with some duration. There's always a way to propagate the span ID from place to place (e.g. in a http header so correlation can be done between processes/machines). This explains sibling comments about not being able to track tasks between workflows: the span ID wasn't propagated.

  • imiric a day ago

    That's a good way of looking at it, but it assumes that both start and end events will be emitted and will successfully reach the backend. What happens if one of them doesn't?

    • candiddevmike a day ago

      AIUI, there aren't really start or end messages, they're spans. A span is technically an "end" message and will have parent or child spans.

      • BoiledCabbage a day ago

        I don't know the details but does a span have a beginning?

        Is that beginning "logged" at a separate point in time from when the span end is logged?

        > AIUI, there aren't really start or end messages,

        Can you explain this sentence a bit more? How does it have a duration without a start and end?

    • lijok a day ago

      Depends on the visualization system. It can either not display the entire trace or communicate to the user that the start of the trace hasn’t been received or the trace hasn’t yet concluded. It really is just a bunch of structured log lines with a common attribute to tie them together.

    • hinkley a day ago

      Ugh. One of the reasons I never turned on the tracing code I painstakingly refactored into our stats code was discovering that OTEL makes no attempts to introduce a span to the collector prior to child calls talking about it. Is that really how you want to do event correlation? Time traveling seems like an expensive operation when you’re dealing with 50,000 trace events per second.

      The other turns out to be our OPs teams problem more than OTEL’s. Well a little of both. If a trace goes over a limit then OTEL just silently drops the entire thing, and the default size on AWS is useful for toy problems not retrofitting onto live systems. It’s the silent failure defaults of OTEL that are giant footguns. Give me a fucking error log on data destruction, you asshats.

      I’ll just use Prometheus next time, which is apparently what our OPs team recommended (except one individual who was the one I talked to).

      • nijave a day ago

        You can usually turn logging on but a lot of the OTEL stack defaults to best effort and silently drops data.

        We had Grafana Agent running which was wrapping the reference implementation OTEL collector written in go and it was pretty easy to see when data was being dropped via logs.

        I think some limitation is also on the storage backend. We were using Grafana Cloud Tempo which imposes limits. I'd think using a backend that doesn't enforce recency would help.

        With the OTEL collector I'd think you could utilize some processors/connectors or write your own to handle individual spans that get too big. Not sure on backends but my current company uses Datadog and their proprietary solution handles >30k spans per trace pretty easily.

        I think the biggest issue is the low cohesion, high DIY nature of OTEL. You can build powerful solutions but you really need to get low level and assemble everything yourself tuning timeouts, limits, etc for your use case.

        • hinkley a day ago

          > I think the biggest issue is the low cohesion, high DIY nature of OTEL

          OTEL is the SpringBoot of telemetry and if you think those are fighting words then I picked the right ones.

  • hinkley a day ago

    Every time people talk about OTel I discover half the people are talking about spans rather that stats. For stats it’s not a ‘fancy logger’ because it’s condensing the data at various steps.

    And if you’ve ever tried to trace a call tree using correlationIDs and Splunk queries and still say OTEL is ‘just a fancy’ then you’re in dangerous territory, even if it’s just by way of explanation. Don’t feed the masochists. When masochists derail attempts at pain reduction they become sadists.

madduci a day ago

I use Otel running in a GKE cluster and tracking Jenkins jobs, whose spans/traces can track long time running jobs pretty well