Comment by jcgrillo

Comment by jcgrillo 10 months ago

Why do all these things use such damnably inefficient wire formats?

For metrics, we're shipping a bunch of numbers over the wire, with some string tags. So why not something like:

  message Measurements {
    uint32 metric_id = 1;
    uint64 t0_seconds = 2;
    uint32 t0_nanoseconds = 3;
    repeated uint64 delta_nanoseconds [packed = true] = 4;
    repeated int64 values [packed = true] = 5;
  }

Where delta_nanoseconds represents a series of deltas from timestamp t0 and values has the same length as delta_nanoseconds. Tags could be sent separately:

  message Tags {
    uint32 metric_id = 1;
    repeated string tags = 2;
  }

That way you only have to send the tags if they change and the values are encoded efficiently. I bet you could have really nice granular monitoring e.g. sub ms precision quite cheaply this way.

Obviously there are further optimizations we can make if e.g. we know the values will respond nicely to delta encoding.

trask 10 months ago

You may be interested in this: https://github.com/open-telemetry/otel-arrow#benchmark-summa...

Reply View 1 reply

jcgrillo 10 months ago

Those are interesting results! I'm not surprised it works a lot better for metrics than logs and traces. Something I'd really love to have for logs/traces processing is the ability to query clp[1][2] with a dataframe interface (e.g. datafusion [3]). While I'm on that subject, I'd also prefer that interface for metrics processing. I don't need real-time streaming metrics graphs, it's perfectly fine to compute one on-demand.
I suspect something like clp is the way to go for logs-like data, that is, low entropy text with a lot of numerical content.
[1] https://www.uber.com/blog/reducing-logging-cost-by-two-order... [2] https://www.uber.com/blog/modernizing-logging-with-clp-ii/ [3] https://github.com/apache/datafusion

Reply View | 0 replies

antonyt 10 months ago

Do you know for sure that otel doesn't do this? Most collector pipelines I've seen use the batch processor, which may include exactly what you're describing. Not being obtuse, I've never looked at the source to see what it does.

Reply View 1 reply

jcgrillo 10 months ago

Not as far as I can tell from the schema definitions[1].
[1] https://github.com/open-telemetry/opentelemetry-proto/tree/v...

Reply View | 0 replies

PeterCorless 10 months ago

Most modern developers (those who got started >2000) never had to worry about hyperefficiency — i.e., bitpacking. To them bandwidth, like diskspace, is near infinite and free. Who uses a single reserved bit to set a zero or a 1 these days when you can use a whole int32 (or int64)?

Yet I applaud your desire to make things more wire-efficient.

Reply View 2 replies

jcgrillo 10 months ago

In the cloud where data transfer fees dominate it's really important. Although nobody seems to realize this and they just pay the Amazon tax lol.

Reply View | 1 reply
- PeterCorless 10 months ago
  
  The board gathers around the monthly cloud vendor bill and wonder why they need to raise a new round just to pay it off.
  
  Reply View | 0 replies

bboreham 10 months ago

Generally you only have one point to send per series. You send all the points for ‘now’, then in N seconds you send them all again.

Reply View 3 replies

jcgrillo 10 months ago

Can you expand upon this? Why would I have more than one point per timestamp? Or am I misunderstanding?
Let's say I'm measuring some quantity, maybe the execution time of an http request handler, and that handler is firing roughly every millisecond and taking a certain amount of time to complete. I'd have about a thousand measurements per second, each with their own timestamp--which to be clear can be aliased if something happens in the same nanosecond! It's totally fine to have a delta of zero. But the point is this value is scalar--it's represented by a single point.
But it seems like you're suggesting vector-valued measurements are a common thing as well--e.g. I should expect to send multiple points per measurement? I'm struggling to think of an application where I'd want this.. I guess it would be easy enough to add more columns.. e.g. values0, values1, ...
EDIT: oh, I see, I think you're saying I should locally aggregate the measurements with some aggregation function and publish the aggregated values.. Yeah that's something I'd really prefer to avoid if possible. By aggressively aggregating to a coarse timestamp we throw away lots of interesting frequency information. But either way, I don't think that really affects this format much. You could totally use it for an aggregated measurement as well. And yeah each of these Measurements objects represents a timeseries of measurements--we'd simultaneously append to a bunch of them, one for each timeseries. I probably should have called it "Timeseries" instead.
EDIT2: It might be worth spelling out a bit why this format is efficient. It has to do with the details of how Google Protocol Buffers are encoded[1]. In particular, the timestamps are very cheap so long as the delta is small, and the values can also be cheap if they're small numbers--e.g. also deltas--which for ~continuously and sufficiently slowly varying phenomena is usually the case. Moreover, packed repeated fields[2] (which is unnecessary to specify in proto3 but I included here to be explicit about what I mean) are further efficient because they omit tags and are just a length followed by a bunch of variable-width encoded values. So this is leaning on packing, varints, and delta-encoding to be as compact as possible.
[1] https://protobuf.dev/programming-guides/encoding/ [2] https://protobuf.dev/programming-guides/encoding/#packed

Reply View | 2 replies
- bboreham 10 months ago
  
  Take a step back. You wrote:
  repeated int64 values
  I’m saying that in most cases there will only be one value, hence ‘repeated’ is unnecessary.
  I didn’t say anything about aggregation, but yes one counts things going at a thousand per second rather than sending all the detail. The Otel signal if you want detail is traces, not metrics.
  
  Reply View | 1 reply
  
  jcgrillo 10 months ago
  
  > Take a step back.
  Excuse me? Modify your tone, read what I wrote again, and this time make an effort to understand it. I'd be happy to answer any questions you might have.
  I'm sorry if this sounds harsh but I truly cannot tell if you're trolling or what.. I think I made a serious effort to understand what you were talking about, and it seems like you haven't done the same.
  
  Reply View | 0 replies