Comment by alecco
Ignore everybody else. Start with CUDA Thrust. Study carefully their examples. See how other projects use Thrust. After a year or two, go deeper to cub.
Do not implement algorithms by hand. Recent architectures are extremely hard to reach decent occupancy and such. Thrust and cub solve 80% of the cases with reasonable trade-offs and they do most of the work for you.
It looks quite nice just from skimming the link.
But, I don’t understand the comparison to TBB. Do they have a version of TBB that runs on the GPU natively? If the TBB implementation is on the CPU… that’s just comparing two different pieces of hardware. Which would be confusing, bordering on dishonest.