Comment by sethammons

We had a hell of a time attempting to roll out OTel for that kind of work. Our scale was also billions of requests per day.

We ended up taking tracing out of these jobs, and only using on requests that finish in short order, like UI web requests. For our longer jobs and fanout work, we started passing a metadata object around that appended timing data related that specific job and then at egress, would capture the timing metadata and flag abnormalities.