Comment by adefa

Comment by adefa 5 hours ago

3 replies

Benchmarks using DGX Spark on vLLM 0.15.1.dev0+gf17644344

  FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8

  Sequential (single request)

    Prompt     Gen     Prompt Processing    Token Gen
    Tokens     Tokens  (tokens/sec)         (tokens/sec)
    ------     ------  -----------------    -----------
       521        49            3,157            44.2
     1,033        83            3,917            43.7
     2,057        77            3,937            43.6
     4,105        77            4,453            43.2
     8,201        77            4,710            42.2

  Parallel (concurrent requests)

    pp4096+tg128 (4K context, 128 gen):

     n    t/s
    --    ----
     1    28.5
     2    39.0
     4    50.4
     8    57.5
    16    61.4
    32    62.0

    pp8192+tg128 (8K context, 128 gen):

     n    t/s
    --    ----
     1    21.6
     2    27.1
     4    31.9
     8    32.7
    16    33.7
    32    31.7
cmrdporcupine 3 hours ago

I tried the FP8 in vLLM on my Spark and although it fit in memory, I started swapping once I actually tried to run any queries, and, yeah, could not have a context larger than 8k.

I figured out later this is because vLLM apparently de-quantizes to BF16 at runtime, so pointless to run the FP8?

I get about 30-35 tok/second using llama.cpp and a 4-bit quant. And a 200+k context, using only 50GB of RAM.

  • justaboutanyone 2 hours ago

    Running llama.cpp rather than vLLM, it's happy enough to run the FP8 variant with 200k+ context using about 90GB vram

    • cmrdporcupine an hour ago

      yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.