Optimizing math code

Hello LLVM-dev,

I’m writing some crypto math code in C, and trying to make it as simple and portable as possible, so that there won’t be any unforeseen platform bugs. Ideally, I’d like the compiler to as much work as possible for me, and I’m wondering if you know how to make clang (or gcc, for that matter) do some of this. So I have a couple of questions. If this is the wrong list to ask for LLVM+Clang questions, please point me to a better one :slight_smile:

First, addition. I have multiprecision integer objects, and I’d like to add them component-wise (likewise, subtract, negate, mask…). For example:

struct mp {
  int limb[8];
} __attribute__((aligned(32))) ;

void add(struct mp *a, const struct mp *b, const struct mp *c) {
    for (int i=0; i<8; i++) a->limb[i] = b->limb[i] + c->limb[i];
}

When compiling this code on an architecture with a vector unit, I’d ideally like to see it get vectorized. With AVX2, the whole function can be done in 3 instructions (load, add, store), or as little as 1 instruction if it’s inlined and the inputs are already in registers. But clang doesn’t vectorize it. Is there a simple way to get it to vectorize portably? I’d like to port to ARM NEON as well as SSE and AVX, and I’d like to be compatible with GCC, so I’d rather not use ext_vector_length if I can avoid it. If I can’t avoid it, I can cobble something together with a big vector_intrinsics.h file or something.

You might think that vectorizing a 4-long loop wouldn’t matter, but of course field addition is inlined and called all the time.

GCC 4.7.2 vectorizes the above code, but bails to scalar code if a==b or a==c, which isn’t necessary. It doesn’t check this if you declare the arrays restrict, but in fact the function is called with a=b=c, so restrict would be wrong.

A second issue is multiplication. Even in a trivial case, I can’t convince either clang-3.3 or gcc-4.7.2 to unroll a nested loop with a fixed trip count:

void mul(struct foo *a, const struct foo *b, const struct foo *c) {
    for (int i=0; i<2; i++) {
      for (int j=0; j<=i; j++) {
        a->limb[i] += b->limb[j] * c->limb[j-i];
      }
    }
}

On GCC, -funroll-loops makes the code much worse by duplicating the inner loop body many times. Clang doesn’t do that, but it still doesn’t unroll the inner loop. Is there a way to automatically unroll small, fixed, nested loops?

Thanks for your time,
— Mike

If the struct represents an integer, shouldn't the add take the carry into account?

Yes. However, I’m using a redundant representation, with 56-bit digits in a 64-bit limb. (The real code uses uint64_t’s instead of ints.) That way up to 8 bits of carries can accumulate, and I can propagate them all at a more opportune time.

Cheers,
— Mike

Clang3.3 is really old. Please try the latest LLVM.

Same result on clang version 3.5 (trunk 201542).

-- Mike

Two things to note here:

1) Your representation makes it very hard to vectorise. If you were doing the accumulate on every cycle, then you wouldn't even need to treat them as a vector and they could simply be folded to the largest integer type that the target supports, so if the vector registers can contain 128-bit integers that would work too.

2) By deferring the carry calculation, there is a significant chance that you are inserting a timing attack into your crypto code. I'm finding it quite difficult to think of ways that it is possible to compile code using this representation that doesn't leak information about the key to sufficiently patient attackers. You might like to look up some of the problems with 64-bit multiply on older ARM systems for an idea why - many of those theoretical attacks are now practical for a remote attacker in a couple of hours.

David
(Who is glad he doesn't have to write crypto code)

Yes. However, I’m using a redundant representation, with 56-bit digits in a 64-bit limb. (The real code uses uint64_t’s instead of ints.) That way up to 8 bits of carries can accumulate, and I can propagate them all at a more opportune time.

Two things to note here:

1) Your representation makes it very hard to vectorise. If you were doing the accumulate on every cycle, then you wouldn't even need to treat them as a vector and they could simply be folded to the largest integer type that the target supports, so if the vector registers can contain 128-bit integers that would work too.

On the contrary, if I were doing the accumulate on every cycle, it would be nearly impossible to vectorize. As it is, the code is easy to vectorize: it's running fine with __attribute__((ext_vector_length(...))). I just want something more portable. Like a "for" loop which is equivalent to a vector op.

I'm not aware of any platform which supports 128-bit integer ops in its vector registers... what target were you thinking of?

2) By deferring the carry calculation, there is a significant chance that you are inserting a timing attack into your crypto code. I'm finding it quite difficult to think of ways that it is possible to compile code using this representation that doesn't leak information about the key to sufficiently patient attackers.

Redundant encodings are especially popular with implementors who are worried about side-channel attacks. This is because with a standard encoding, you can't check if your carry chain is done without exposing yourself to a timing attack. So you have to assume that the carries propagate for as long as possible. Even if you're just adding 1, you have to propagate the carry all the way. But with a redundant encoding, it doesn't propagate at all.

Likewise, the reductions in my code aren't done by testing to see if overflow is about to occur. Instead, they're done by a static analysis (in an external program) which computes when the numbers might overflow, and where reductions should be placed to prevent this. So I'm hoping that I can avoid screwing this property up over a couple thousand lines.

The reductions themselves also take constant time. They basically just shuffle 8 bits up to the next word, and do an add. The resulting number still isn't perfectly reduced -- that's much more expensive -- but the carry on each limb is now at most 1. This can also be vectorized, but it's more annoying because of the shuffles involved.

You might like to look up some of the problems with 64-bit multiply on older ARM systems for an idea why - many of those theoretical attacks are now practical for a remote attacker in a couple of hours.

Yeah, I'll have to be careful about how things get reduced on ARM. In particular, I'm using PCMPEQD in some places, which is constant-time, but the most obvious translation of it to a vectorless platform is not constant-time. And I'll have to make sure that if there are any 64-bit ops left, they end up constant-time. Likewise, I need to either make asm intrinsics for any true 128-bit ops that I want to do on a 64-bit platform, or else find a way to assure myself that they consistently emit ADC and not a jump.

On the other hand, I'm not targeting ARM platforms with a variable-time multiply instruction, because that's a pain.

Cheers,
-- Mike