redrove 2 days ago

I wouldn’t say runs. More of a gentle stroll.

  • storus 2 days ago

    I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.

    • embedding-shape 2 days ago

      > I run it all the time, token generation is pretty good.

      I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?

      • storus 2 days ago

        I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.

  • hasperdi 2 days ago

    With quantization, converting it to an MOE model... it can be a fast walk