Comment by buildbot
Thread counts per block are limited to 1024 (unless I’ve missed and change and wikipedia is wrong), but total threads per kernel is 1024(2^32-1)65535*65535 ~= 2^74 threads
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...
Yeah I’m talking about the limit per-block.