Comment by pixelsynth
Comment by pixelsynth 5 days ago
Yes, Spark does instanced rendering of quads, one covering each Gaussian splat. The sorting is done by 1) calculating sort distance for every splat on the GPU, 2) reading it back to the CPU as float16s, 3) doing a 1-pass bucket sort to get an ordering of all the splats from back to front.
On most newer devices the sorting can happen pretty much every frame with approx 1 frame latency, and runs in parallel on a Web Worker. So the sorting itself has minimal performance impact, and because of that Spark can do fully dynamic 3DGS where every splat can move independently each frame!
On some older Android devices it can be a few frames worth of latency, and in that case you could say it's amortized over a few frames. But since it all happens in parallel there's no real impact to the overall rendering performance. I expect for most devices the sorting in Spark is mostly a solved problem, especially with increasing memory bandwidth and shared CPU-GPU memory.
If you say 1 pass bucket sorting.. I assume you do sort the buckets as well?
I've implemented a radix sort on GPU to sort the splats (every frame).. and I'm not quite happy with performance yet. A radix sort (+ prefix scan) is quite involved with lot's of dedicated hierarchical compute shaders.. I might have to get back to tune it.
I might switch to float16s as well, I'm a bit hesitant, as 1 million+ splats, may exceed the precision of halfs.