Comment by imiric

Comment by imiric a day ago

8 replies

That's a good way of looking at it, but it assumes that both start and end events will be emitted and will successfully reach the backend. What happens if one of them doesn't?

candiddevmike a day ago

AIUI, there aren't really start or end messages, they're spans. A span is technically an "end" message and will have parent or child spans.

  • BoiledCabbage a day ago

    I don't know the details but does a span have a beginning?

    Is that beginning "logged" at a separate point in time from when the span end is logged?

    > AIUI, there aren't really start or end messages,

    Can you explain this sentence a bit more? How does it have a duration without a start and end?

    • nijave a day ago

      A span is a discrete event emitted on completion. It contains arbitrary metadata (plus a few mandatory fields if you're following the OTEL spec).

      As such, it doesn't really have a beginning or end except that it has fields for duration and timestamps.

      I'd check out the OTEL docs since I think seeing the examples as JSON helps clarify things. It looks like they have events attached to spans which is optional. https://opentelemetry.io/docs/concepts/signals/traces/

    • hinkley a day ago

      It’s been a minute since I worked on this but IIRC no, which means that if the request times out you have to be careful to end the span, and also all of the dependent calls show up at the collector in reverse chronological order.

      The thing is that at scale you’d never be able to guarantee that the start of the span showed up at a collector in chronological order anyway, especially due to the queuing intervals being distinct per collection sidecar. But what you could do with two events is discover spans with no orderly ending to them. You could easily truncate traces that go over the span limit instead of just dropping them on the floor (fuck you for this, OTEL, this is the biggest bullshit in the entire spec). And you could reduce the number of traceids in your parsing buffer that have no metadata associated with them, both in aggregate and number of messages in the limbo state per thousand events processed.

lijok a day ago

Depends on the visualization system. It can either not display the entire trace or communicate to the user that the start of the trace hasn’t been received or the trace hasn’t yet concluded. It really is just a bunch of structured log lines with a common attribute to tie them together.

hinkley a day ago

Ugh. One of the reasons I never turned on the tracing code I painstakingly refactored into our stats code was discovering that OTEL makes no attempts to introduce a span to the collector prior to child calls talking about it. Is that really how you want to do event correlation? Time traveling seems like an expensive operation when you’re dealing with 50,000 trace events per second.

The other turns out to be our OPs teams problem more than OTEL’s. Well a little of both. If a trace goes over a limit then OTEL just silently drops the entire thing, and the default size on AWS is useful for toy problems not retrofitting onto live systems. It’s the silent failure defaults of OTEL that are giant footguns. Give me a fucking error log on data destruction, you asshats.

I’ll just use Prometheus next time, which is apparently what our OPs team recommended (except one individual who was the one I talked to).

  • nijave a day ago

    You can usually turn logging on but a lot of the OTEL stack defaults to best effort and silently drops data.

    We had Grafana Agent running which was wrapping the reference implementation OTEL collector written in go and it was pretty easy to see when data was being dropped via logs.

    I think some limitation is also on the storage backend. We were using Grafana Cloud Tempo which imposes limits. I'd think using a backend that doesn't enforce recency would help.

    With the OTEL collector I'd think you could utilize some processors/connectors or write your own to handle individual spans that get too big. Not sure on backends but my current company uses Datadog and their proprietary solution handles >30k spans per trace pretty easily.

    I think the biggest issue is the low cohesion, high DIY nature of OTEL. You can build powerful solutions but you really need to get low level and assemble everything yourself tuning timeouts, limits, etc for your use case.

    • hinkley a day ago

      > I think the biggest issue is the low cohesion, high DIY nature of OTEL

      OTEL is the SpringBoot of telemetry and if you think those are fighting words then I picked the right ones.