Difference in generated code between variadic parameter pack and manual version

Hi there folks,

I wonder if anybody can shed some light on this. I’m looking at a function with a parameter pack argument and one without, that should do the exact same thing.

https://godbolt.org/z/Keqzcj

However, the version with the parameter pack expands (at -O3 -march=broadwell, on clang 10.0.1, on godbolt) into a loop per 128 bytes, plus a loop per 64 bytes, plus nonvectorized instructions to process the remaining <=63 bytes. The manual version expands to just a loop per 128 bytes (256-bit vectors, unrolled 4x), and nonvectorized instructions to process the remaining <=127 bytes.

It’s not about the fold expression. I replaced the inner loop of the first function by:

auto tuple = std::make_tuple(input[i]…);
out[i] = get<0>(tuple) | get<1>(tuple) | get<2>(tuple);

And it generates the same code AFAICT.

It may be about restrict expansion for parameter pack arguments. But I don’t see how restrict could lead to these differences.

FWIW, my benchmarks seem to indicate that the variadic version is about 50% slower. I have no idea why. The instruction order in the inner loop is different, which may make a difference?

Any clues would be appreciated!

It’s about the fold expression.
https://godbolt.org/z/EPETj9

With C++17 fold-expressions, (args | …) doesn’t mean (arg1 | arg2 | arg3); it means (arg1 | (arg2 | arg3)). So with the right-fold you wrote, you’re telling the compiler to OR the values together “right-to-left”, whereas the non-template version does it “left-to-right”: ((arg1 | arg2) | arg3). And apparently this makes some huge difference to the codegen (which is still mysterious to me, but out of my depth).

Switch the right-fold to a left-fold and the codegen becomes identical, at least to my eyes. (In the above Godbolt, put -DVARIADIC in one compiler frame and nothing in the other.)

–Arthur

Hi there folks,

I wonder if anybody can shed some light on this. I’m looking at a function with a parameter pack argument and one without, that should do the exact same thing.

https://godbolt.org/z/Keqzcj

However, the version with the parameter pack expands (at -O3 -march=broadwell, on clang 10.0.1, on godbolt) into a loop per 128 bytes, plus a loop per 64 bytes, plus nonvectorized instructions to process the remaining <=63 bytes. The manual version expands to just a loop per 128 bytes (256-bit vectors, unrolled 4x), and nonvectorized instructions to process the remaining <=127 bytes.

It’s about the fold expression.
https://godbolt.org/z/EPETj9

With C++17 fold-expressions, (args | …) doesn’t mean (arg1 | arg2 | arg3); it means (arg1 | (arg2 | arg3)). So with the right-fold you wrote, you’re telling the compiler to OR the values together “right-to-left”, whereas the non-template version does it “left-to-right”: ((arg1 | arg2) | arg3). And apparently this makes some huge difference to the codegen (which is still mysterious to me, but out of my depth).

That is just plain weird, and probably interesting for the codegen folks to look at. :slight_smile: Thanks a lot for figuring this out!

We should probably be discuss this on llvm-dev.
Arthur pointed out the "clang caused" differences already, from the generated IR this looks not "substantially different" at all (https://godbolt.org/z/6onnfh)

As was reported, this is a vectorizer "issue", probably some pattern matching gone wrong but maybe more.
Vectorizer remarks might already be useful here.

~ Johannes