[RFC] Vectorization support for histogram count operations

paschalis.mpeis · November 9, 2023, 12:56pm

Hello community,

We would like to propose a mechanism that detects and safely vectorizes histogram computations. This is a work in progress, and a Proof-of-Concept Pull Request will be opened in the coming days.

1. Background:

What is a histogram, and where is it used

Histogram is typically used in image processing and it ‘counts’ actual values of an image. Assuming a colored input image, a histogram would compute counters for occurrences of ‘red’, ‘green’, and ‘blue’ image values. In a typical RGB model, each color accepts values in the range of 0-255.

In our context, it’s just an input integer array and we want to count the number of occurrences of each element and store them in a separate array (ie the histogram). The input array has a consecutive number range, starting from zero, so there is no need for any advanced hash functions to find where to store each counter. Instead we can use the input values as indices for the histogram array (see motivating example code).

2. Motivating example:

The snippet below shows a simple, typical histogram computation. The indices array is the input we want to compute the histogram for, and the buckets is where we store the computed histogram (ie all those counters).

for (int i = 0; i < N; ++i) {
    buckets[indices[i]]++;
}

Example input, output:

In this example, the computed histogram indicates that there are:

1 zero, 1 one, 2 twos, 1 three, and no fours.

# INPUT ARRAY:
    index: 0 1 2 3 4
  indices: 3 2 2 0 1

# COMPUTED HISTOGRAM:
    index: 0 1 2 3 4
  buckets: 1 1 2 1 0

Problem with histogram and vectorization

The example code is using the values of indices (eg, indices[i]) as a hash function to find where to store the counts within the buckets array.

As the compiler cannot identify at static time what are the ranges of the pointers involved, these memory operations are found to be aliasing, which disables vectorization.

Assuming vectorization went ahead, a pseudo-code IR would be something like:

%values = wide load from 'indices'
%prevCount = wide gather-load vector from 'buckets' using %values as indices
%newCount = add %prevCount, 1;
scatter the result (%newCount) back to relevant places (using %values as indices) in `buckets`

There is a correctness issue with the vectorized addition: each lane is not ‘aware’ of any previous counts that happened on the same iteration.

This is illustrated with a dry run on the Example Input focusing on the value ‘2’ , which is encountered twice.
Value 2 is on lane 1, and lane 2, and assuming a vector length of 5 lanes, the entire vector fits in one iteration.
Running the Code:

The 2 twos will be wide loaded initially from indices.
Then, the relevant addresses will be gathered from buckets, which initially are zero.
The addition, computes ‘0+1’ for both.
Finally, the results will be scattered back to the relevant places in buckets.

for counting the twos, there will be two writes on the same index (ie buckets[2]).
according to the manual, the order of those writes will be from least to most significant element.
in this case, it doesn’t really matter what is written first. It will write the value of 1, and then it will overwrite it again with 1. This is wrong. The final result should have been 2.

Instruction needed to support histogram counts

What is needed is an instruction that considers results from previous lanes when computing the count of the current lane.

The AArch64 SVE2 histcnt Instruction does precisely that. With such an instruction, the pseudo-IR will now look like:

%values = wide load from 'indices'
%prevCount = wide gather-load vector from 'buckets' using %values as indices
%count = call @histogram.count(%values) ; <- CONSIDERS PREVIOUS LANES
%newCount = add %prevCount, %count;
scatter the result (%newCount) back to relevant places (using %values as indices) in `buckets`

Doing a dry run just for the final scatter store, using histogram.count would be:

write 1 to the relevant place in buckets.
then overwrite it with a 2, which is the correct result.
(Remember, scatter stores from least to most significant lanes)

3. Design

We propose the addition of an IR-level Intrinsic, that could then be lowered to target specific ones.

That intrinsic can take the input vector and a mask (for supporting predicated executions).

Targets can use TTI to inform LoopVectorizer whether they support such an Instruction, and when they do they can emit a target-specific Instruction in place of the generic, IR-Level intrinsic.

Initially, with pattern matching, can detect histograms and generate relevant STATISTICS. This information can be useful for understanding input codes, so it will be generated for any target regardless of hardware support.

A pattern should match the motivating example, which in essence consists of:

a binary operation that does the counting (can be an add or even a sub) on constants integers (does not have to be a +1 like in the example).
this counting updates previous counts within an array that uses another array to find relevant indices (eg, buckets[indices[i]])

For detected histograms, different VPlans will be generated as usual.
During legality, LoopAccessAnalysis can safely ignore any pair of pointers that relate to histogram computations, as those are taken care by the intrinsic.

Loop interleaving will also be disabled programmatically for the VPlans that will emit a histogram Intrinsic, as there will be unmet dependencies between the interleaved copies.

To emit the Intrinsic, a new VPInstruction will be used, to generate a Histogram Intrinsic and the relevant add/sub BinOp upon plan’s execution.

To support predication, the Mask for the vectorized block will be re-used, otherwise Histogram will use an All-Ones Mask to perform on all lanes.

The best VPlan Recipe will be selected for execution according to a CostModel.

Best regards,
Paschalis Mpeis, Graham Hunter

rscottmanley · November 15, 2023, 2:01pm

There are multiple ways to resolve the potential index conflicts and “count” is only one of them. Would this proposal work with the AVX512CD equivalents?

I would suggest considering a different approach that instead encompasses the entire histogram operation… e.g.

call @histogram(input/outputvector, rhs, mask)
or 
call @histogram(base, index, rhs, mask)

This has the advantage of being easily legalized on targets without count or even without gather/scatter. It also gives you a lot of flexibility in the backend in deciding what algorithm you want to choose for a given target even IF you have count or gather/scatter. To me, count is too specific for this unless there’s another practical use for it.

paschalis.mpeis · November 20, 2023, 11:20am

Hi Scott,

Thanks for your suggestion. Indeed, a more generalized approach would be better. Can you please elaborate what rhs represents in your snippet?

EDIT: you must mean the amount, as in:

buckets[indices[i]]+=amount;

Regards,
Paschalis

rscottmanley · November 21, 2023, 6:55pm

Yes, I meant the “right hand side” of the expression that’s not the gather/scatter. You would probably need to carry this value in the intrinsic itself for the lowering sequence.

graham.hunter · March 20, 2024, 8:46am

Hi, just coming back to this now. I’ve been implementing your suggestion, but have also been looking at vpconflict to see whether the original works.

Since vpconflict produces a bitvector in each lane of matching (lower-numbered) lanes, you can do a popcnt of that result and add a splat of one to match the behaviour of histcnt. So we can do it with the original.

That also gives us more flexibility, so if the programmer implemented something like saturating adds for the buckets we would easily be able to handle that.

The drawback appears to be supporting interleaving. I’m not sure that we want to do that since these are relatively expensive operations, but I think we’d at least want to be able to if needed. The histcnt instruction itself would be fine since it takes a separate input register to match against, but vpconflict only considers one input register (plus mask). So we would need to change how interleaving works slightly in order to get the correct results – you can’t load the bucket values for the second (or subsequent) histogram operation before storing the result of the previous operation(s).

If you bundle the memory operations together with the histogram then that effectively just happens without extra work.

I’ll continue to prototype this, but it would be good to discuss the tradeoffs a bit.

graham.hunter · April 9, 2024, 10:41am

Posted a proof of concept patch for the counter-proposed intrinsic for discussion: Add an all-in-one histogram intrinsic, along with lowering for AArch64 by huntergr-arm · Pull Request #88106 · llvm/llvm-project · GitHub

graham.hunter · April 19, 2024, 11:07am

Added a full proof-of-concept patch using the originally proposed intrinsic to compare approaches: Add a standalone histogram intrinsic, use it to vectorize simple histograms by huntergr-arm · Pull Request #89366 · llvm/llvm-project · GitHub

Some of my thoughts about the tradeoffs between the two:

Vectorization flexibility

The combined intrinsic performs the memory operations as part of the intrinsic, so intermediate results can’t be used easily. We could potentially return a vector with the resulting counts, or the initial bucket values, or both. We could provide extra variants of intrinsic to do things like saturating adds. But those all increase the amount of code needed to handle the intrinsics and may still fall short. We have a downstream C example that features an early loop exit if the value in any bucket exceeds a certain threshold. I don’t know how realistic that example is, but the combined intrinsic is certainly less flexible in terms of the loops that could be vectorized. I think the standalone intrinsic wins on this point.

LoopVectorize changes

The combined intrinsic will require some extra work in the vectorizer, since it needs to avoid emitting the load and store in their original positions and instead use the single intrinsic to represent multiple operations. The standalone intrinsic just takes the place of a normal widening recipe for the add or sub used when calculating the histogram.

However, if we want to support interleaving when vectorizing histograms, we would need extra work for both. For the standalone intrinsic, the normal interleave pattern of:

Gather1
Gather2
Histogram1
Histogram2
Scatter1
Scatter2

wouldn’t work, since Gather2 may have overlaps with Gather1 and would be using bucket values before the first update. We would need to change VPlan to emit the following for histograms instead:

Gather1
Histogram1
Scatter1
Gather2
Histogram2
Scatter2

If we don’t care about supporting interleaving with histograms (quite possible, since gathers and scatters are usually quite expensive) then the standalone intrinsic has less impact on LoopVectorize.

If we do want to support interleaving, then they both require some extra work and there’s no real advantage to either approach.

Default lowering

Both of these can be lowered for targets without an appropriate instruction, though the combined intrinsic is more straightforward (a scalar loop) than repeated extracts/shuffles/compares/inserts for the standalone intrinsic.

We would be able to do the scalar loop in LoopVectorize if we insist on allowing interleaving for the standalone intrinsic.

I don’t think there’s a clear winner here.

Legalization

Legalizing the intrinsics via operand promotion is straightforward enough, but if we need to split the operands then the standalone intrinsic breaks – we can’t represent it easily with the instructions. The combined intrinsic will split the memory ops along with the histogram operation, and you can use the chain operand to enforce ordering of the lowered gathers and scatters once the intrinsic has been legalized.

The combined intrinsic wins here.

Personally, I’m leaning towards the standalone intrinsic due to the flexibility, but if the other points (especially the last) are more important then I can implement the combined intrinsic instead.

Topic		Replies	Views
loop vectorizer and storing to uniform addresses LLVM Dev List Archives	4	103	November 8, 2013
scev questions LLVM Dev List Archives	4	106	August 22, 2013
Some MLIR warts we've found in our convolution kernel generator MLIR	11	673	November 17, 2021
Missed vectorization opportunities? LLVM Dev List Archives	0	62	April 23, 2015
[Vectorization] Mis match in code generated LLVM Dev List Archives	9	89	November 11, 2014