Comment by sparkie
We can do better than one branch per byte - we can have it per 8-bytes at least.
We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.
Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.
[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
Sure, but with the above algorithm you could do it in zero branches, and in parallel if you like.