Hello community,

We would like to propose a mechanism that detects and safely vectorizes histogram computations. This is a work in progress, and a Proof-of-Concept Pull Request will be opened in the coming days.

# 1. Background:

### What is a histogram, and where is it used

Histogram is typically used in image processing and it ‘counts’ actual values of an image. Assuming a colored input image, a histogram would compute counters for occurrences of ‘red’, ‘green’, and ‘blue’ image values. In a typical RGB model, each color accepts values in the range of 0-255.

In our context, it’s just an input integer array and we want to count the number of occurrences of each element and store them in a separate array (ie the histogram). The input array has a consecutive number range, starting from zero, so there is no need for any advanced hash functions to find where to store each counter. Instead we can use the input values as indices for the histogram array (see motivating example code).

# 2. Motivating example:

The snippet below shows a simple, typical histogram computation. The `indices`

array is the input we want to compute the histogram for, and the `buckets`

is where we store the computed histogram (ie all those counters).

```
for (int i = 0; i < N; ++i) {
buckets[indices[i]]++;
}
```

### Example input, output:

In this example, the computed histogram indicates that there are:

- 1 zero, 1 one, 2 twos, 1 three, and no fours.

```
# INPUT ARRAY:
index: 0 1 2 3 4
indices: 3 2 2 0 1
# COMPUTED HISTOGRAM:
index: 0 1 2 3 4
buckets: 1 1 2 1 0
```

### Problem with histogram and vectorization

The example code is using the values of indices (eg, `indices[i]`

) as a hash function to find where to store the counts within the buckets array.

As the compiler cannot identify at static time what are the ranges of the pointers involved, these memory operations are found to be *aliasing*, which disables vectorization.

Assuming vectorization went ahead, a pseudo-code IR would be something like:

```
%values = wide load from 'indices'
%prevCount = wide gather-load vector from 'buckets' using %values as indices
%newCount = add %prevCount, 1;
scatter the result (%newCount) back to relevant places (using %values as indices) in `buckets`
```

There is a correctness issue with the vectorized addition: each lane is not ‘aware’ of any previous counts that happened on the same iteration.

This is illustrated with a dry run on the Example Input focusing on the value ‘2’ , which is encountered twice.

Value 2 is on lane 1, and lane 2, and assuming a vector length of 5 lanes, the entire vector fits in one iteration.

Running the Code:

- The 2 twos will be wide loaded initially from indices.
- Then, the relevant addresses will be gathered from buckets, which initially are zero.
- The addition, computes ‘0+1’ for both.
- Finally, the results will be scattered back to the relevant places in buckets.

- for counting the twos, there will be two writes on the same index (ie buckets[2]).
- according to the manual, the order of those writes will be from least to most significant element.
- in this case, it doesn’t really matter what is written first. It will write the value of 1, and then it will overwrite it again with 1. This is wrong. The final result should have been 2.

### Instruction needed to support histogram counts

What is needed is an instruction that considers results from previous lanes when computing the count of the current lane.

The AArch64 SVE2 histcnt Instruction does precisely that. With such an instruction, the pseudo-IR will now look like:

```
%values = wide load from 'indices'
%prevCount = wide gather-load vector from 'buckets' using %values as indices
%count = call @histogram.count(%values) ; <- CONSIDERS PREVIOUS LANES
%newCount = add %prevCount, %count;
scatter the result (%newCount) back to relevant places (using %values as indices) in `buckets`
```

Doing a dry run just for the final scatter store, using histogram.count would be:

- write 1 to the relevant place in buckets.
- then overwrite it with a 2, which is the correct result.
- (Remember, scatter stores from least to most significant lanes)

# 3. Design

We propose the addition of an IR-level Intrinsic, that could then be lowered to target specific ones.

That intrinsic can take the input vector and a mask (for supporting predicated executions).

Targets can use `TTI`

to inform `LoopVectorizer`

whether they support such an Instruction, and when they do they can emit a target-specific Instruction in place of the generic, IR-Level intrinsic.

Initially, with pattern matching, can detect histograms and generate relevant `STATISTICS.`

This information can be useful for understanding input codes, so it will be generated for any target regardless of hardware support.

A pattern should match the motivating example, which in essence consists of:

- a binary operation that does the counting (can be an add or even a sub) on constants integers (does not have to be a +1 like in the example).
- this counting updates previous counts within an array that uses another array to find relevant indices (eg,
`buckets[indices[i]]`

)

For detected histograms, different `VPlans`

will be generated as usual.

During legality, `LoopAccessAnalysis`

can safely ignore any pair of pointers that relate to histogram computations, as those are taken care by the intrinsic.

Loop interleaving will also be disabled programmatically for the VPlans that will emit a histogram Intrinsic, as there will be unmet dependencies between the interleaved copies.

To emit the Intrinsic, a new `VPInstruction`

will be used, to generate a Histogram Intrinsic and the relevant add/sub `BinOp`

upon plan’s execution.

To support predication, the `Mask`

for the vectorized block will be re-used, otherwise Histogram will use an All-Ones Mask to perform on all lanes.

The best VPlan Recipe will be selected for execution according to a `CostModel`

.

Best regards,

Paschalis Mpeis, Graham Hunter