eldenring 2 days ago

the only real benefit is privacy which 99.9% of people dont get about. Almost all serving metrics (cost, throughput, ttft) are better with large gpu clusters. Latency is usually hidden by prefill cost.

  • cowpig 2 days ago

    More and more people I talk to care about privacy, but not in SF

  • mistercheph a day ago

    and sovereignty. I can go into the woods with a fuzzy approximation of all internet text in my backpack

jameslk 2 days ago

128 GB should be enough for anybody (just kidding). I hope the M5 Max will have higher RAM limits

  • aryonoco 2 days ago

    M5 Max probably won’t, but M5 Ultra probably will

ainch a day ago

As LLMs are productionised/commodified they're incorporating changes which are enthusiast-unfriendly. Small dense models are great for enthusiasts running inference locally, but for parallel batched inference MoE models are much more efficient.