Comment by hinkley
Ugh. One of the reasons I never turned on the tracing code I painstakingly refactored into our stats code was discovering that OTEL makes no attempts to introduce a span to the collector prior to child calls talking about it. Is that really how you want to do event correlation? Time traveling seems like an expensive operation when you’re dealing with 50,000 trace events per second.
The other turns out to be our OPs teams problem more than OTEL’s. Well a little of both. If a trace goes over a limit then OTEL just silently drops the entire thing, and the default size on AWS is useful for toy problems not retrofitting onto live systems. It’s the silent failure defaults of OTEL that are giant footguns. Give me a fucking error log on data destruction, you asshats.
I’ll just use Prometheus next time, which is apparently what our OPs team recommended (except one individual who was the one I talked to).
You can usually turn logging on but a lot of the OTEL stack defaults to best effort and silently drops data.
We had Grafana Agent running which was wrapping the reference implementation OTEL collector written in go and it was pretty easy to see when data was being dropped via logs.
I think some limitation is also on the storage backend. We were using Grafana Cloud Tempo which imposes limits. I'd think using a backend that doesn't enforce recency would help.
With the OTEL collector I'd think you could utilize some processors/connectors or write your own to handle individual spans that get too big. Not sure on backends but my current company uses Datadog and their proprietary solution handles >30k spans per trace pretty easily.
I think the biggest issue is the low cohesion, high DIY nature of OTEL. You can build powerful solutions but you really need to get low level and assemble everything yourself tuning timeouts, limits, etc for your use case.