[AARCH64][NEON] Do we need extra builtin for vmull_high_p64?

Folks, we encountered a problem: for vmull_high_p64 intrinsic there was not generated PMULL2 instruction.
This happened because the vmull_high_p64 is implemented through vmull_p64:

__ai poly128_t vmull_high_p64(poly64x2_t __p0, poly64x2_t __p1) {

poly128_t __ret;

__ret = vmull_p64((poly64_t)(vget_high_p64(__p0)), (poly64_t)(vget_high_p64(__p1)));

return __ret;


__ai poly128_t vmull_p64(poly64_t __p0, poly64_t __p1) {

poly128_t __ret;

__ret = (poly128_t) __builtin_neon_vmull_p64(__p0, __p1);

return __ret;


There also exist pattern to convert this into PMULL2:

def : Pat<(int_aarch64_neon_pmull64 (extractelt (v2i64 V128:$Rn), (i64 1)),
(extractelt (v2i64 V128:$Rm), (i64 1))),
(PMULLv2i64 V128:$Rn, V128:$Rm)>;

The problem is that ISel apply that pattern only when corresponding IR is inside basic block.
Some optimizations could hoist extraction operators out of current basic block(Loop invariant code motion).
In the result PMULL2 is not used.

GlobalISel could resolve that problem. But it does not handle this pattern yet and switched on by default for -O0 only.
Another alternative to have PMULL2 is to create specific builtin for vmull_high_p64 intrinsic.

Would it be OK to add extra builtin for vmull_high_p64 intrinsic to resolve this problem(

__builtin_neon_vmull_high_p64/llvm.aarch64.neon.pmull_high_64) ?

Thank you, Alexey.

For this specific sort of issue, we have some code in CodeGenPrepare::tryToSinkFreeOperands to try to rearrange the IR so the necessary instructions are in the same basic block.

If we can’t make that work, we could consider adding a separate intrinsic.