Writing built-ins for instructions returning multiple operands

I have written many builtins for our SHAVE processor which bind directly to our instructions, and making instructions that are not easily selectable by the compiler available to the programmer.

The majority of these are straight-forward enough, taking a small number pf input operands and returning a single result; for example ‘int __builtin_shave_mul(int, int)’ might map onto a simple multiple instruction that takes two input integer operands in registers, and returns a single integer result in another register.

However, I have a small number instructions that have two output operands, each in a separate register. I would like to provide access to these instructions using the builtins approach. This is easy enough to express in LLVM IR, but I have not been able to figure out how this can be presented in a C or C++ binding.

Is there a pattern for doing this kind of thing that I haven’t discovered, or is it simply something that cannot be expressed using the C binding to the builtins?

Thanks,

MartinO (Movidius Ltd.)

Kind of depends on what you actually get back.

For example, if I take the RDTSC instruction on x86, it returns a 64-bit value, but it’s in two registers, EAX and EDX as 32-bit values. The natural thing in that case is to let the builtin form a 64-bit value [either in EAX and EDX in the 32-bit compiler, as that happens to be the way 64-bit values are returned, or as a single 64-bit value by shifting/oring the two register values into one 64-bit return value].

However, if we have, say, an instruction that returns two distinct values (div that also gives the remainder, as a simple example), you will either have to return a (small) struct, or pass in a pointer to be filled in by the function [the latter is not ideal from an optimisation perspective, as the optimiser has a harder time knowing if the output is aliased with something else.

It’s important to differentiate the C builtin from the LLVM intrinsic. It’s generally more useable (and idiomatic) in C to have additional return values become arguments returned by pointer. It’s generally more useful in LLVM IR to have multiple return values as a struct. For an example, consider the overflow-checked builtins.

The following C for a function that multiplies two numbers and returns either the result or 0 on overflow:

unsigned int mul(unsigned int x, unsigned int y)
{
  unsigned int result;
  return __builtin_umul_overflow(x, y, &result) == 0 ? 0 : result;
}

This becomes some fairly complex IR, with the key part being:

  %result = alloca i32, align 4
...
  %5 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %3, i32 %4)
...
  %7 = extractvalue { i32, i1 } %5, 0
  store i32 %7, i32* %result, align 4

The SROA happily turns this entire function into:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %4 = zext i1 %2 to i32
  %5 = icmp eq i32 %4, 0
  br i1 %5, label %6, label %7

; <label>:6 ; preds = %0
  br label %8

; <label>:7 ; preds = %0
  br label %8

; <label>:8 ; preds = %7, %6
  %9 = phi i32 [ 0, %6 ], [ %3, %7 ]
  ret i32 %9
}

SimplifyCFG then turns the branches into a single select:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %4 = zext i1 %2 to i32
  %5 = icmp eq i32 %4, 0
  %. = select i1 %5, i32 0, i32 %3
  ret i32 %.
}

And instcombine gets rid of the redundant zext / icmp:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %. = select i1 %2, i32 %3, i32 0
  ret i32 %.
}

TL;DR version: Just because you expose a builtin to C as something that takes a pointer doesn’t mean that the optimisers will struggle with it if you expose a sensible LLVM IR intrinsic.

David

Thanks for this feedback. Although my example (contrived) used an 'int', the actual instructions involved use vector operands so it’s a bit more tricky, but the approach you have outlined looks workable. I had been avoiding the notion of "pass-by-reference", but the transformations you have outlined should allow me to represent this using pointers or references in C/C++, but lower to the intended instruction and eliminate the implied indirection.

All the best,

  MartinO