Comment by chrisjj

Comment by chrisjj 3 days ago

Sure, but how can that lead to increased demand resulting in decreased intelligence? That is the effect we are discussing.

embedding-shape 3 days ago

Small subtle errors that are only exposed at certain execution parts could be one. You might place things differently onto the GPU depending on how large the batch is, if you've found one way to be faster batch_size<1024, but another when batch_size>1024. As number of concurrent incoming requests goes up, you increase batch_size. Just one possibility, guess there could be a multitude of reasons, as it's really hard to reason about until you sit with the data in front of you. vLLM has had bugs with these sort of thing too, so wouldn't surprise me.

Reply View 3 replies

chrisjj 3 days ago

Wouldn't you think that was as likely to increase as decrease intelligence, so average to nil in the benchmarks?

Reply View | 2 replies
- embedding-shape 3 days ago
  
  No, I'm not sure how that'd make sense. Either you're making the correct (expected) calculations, or you're getting it wrong. Depending the type of wrong or how wrong, could go from "used #2 in attention instead of #1" so "blue" instead of "Blue" or whatever, to completely incoherent text and garbled output.
  
  Reply View | 1 reply
  
  chrisjj 3 days ago
  
  I accept errors are more likely to decrease "intelligence". But I don't see how increased load, through batching, is any more likely to increase than decrease errors.
  
  Reply View | 0 replies