[RFC] Vectorization support for histogram count operations

Hi, just coming back to this now. I’ve been implementing your suggestion, but have also been looking at vpconflict to see whether the original works.

Since vpconflict produces a bitvector in each lane of matching (lower-numbered) lanes, you can do a popcnt of that result and add a splat of one to match the behaviour of histcnt. So we can do it with the original.

That also gives us more flexibility, so if the programmer implemented something like saturating adds for the buckets we would easily be able to handle that.

The drawback appears to be supporting interleaving. I’m not sure that we want to do that since these are relatively expensive operations, but I think we’d at least want to be able to if needed. The histcnt instruction itself would be fine since it takes a separate input register to match against, but vpconflict only considers one input register (plus mask). So we would need to change how interleaving works slightly in order to get the correct results – you can’t load the bucket values for the second (or subsequent) histogram operation before storing the result of the previous operation(s).

If you bundle the memory operations together with the histogram then that effectively just happens without extra work.

I’ll continue to prototype this, but it would be good to discuss the tradeoffs a bit.