Modular arithmetic processors

I’ve been playing around with LLVM to write a backend for a rather “simple” (co-)processor. Assume that only three arithmetic instructions exist: ADD mod N, SUB mod N and MUL mod N. The modulus N is programmable and stored in a register. No ordinary arithmetic instructions are available. The word size is 256-bit.

In other words, the following function, b + c mod N, corresponds to only one instruction on my target machine, given the modulus N set by another instruction to an auxiliary register.

this should translate to one instruction like: r3 = add r1, r2

n is extended to 257-bit for simplicity

define i256 @add(i256 %b, i256 %c, i257 %n) {
%1 = zext i256 %b to i257
%2 = zext i256 %c to i257
%3 = add i257 %2, %1
%4 = icmp uge i257 %3, %n
%5 = select i1 %4, i257 %n, i257 0
%6 = sub i257 %3, %5
%7 = trunc i257 %6 to i256
ret i256 %7
}

The ultimate goal is to use LLVM to optimize and emit machine code for programs full of modular addition, subtraction and multiplication. For example, I might want off-the-shelf LLVM passes to optimize (a+b+c)-(a+c) mod p = b mod p for me. For a large syntax tree, say more than 10,000 nodes, I might even need common subexpression elimination. Of course, we can target the same program to ARM or X86 for performance comparison.

The questions is: We have to recognize the patterns for modular arithmetic from optimized IR and translate into corresponding instructions. After investigation, I think LLVM can not do a good job on such a simple but unordinary instruction set. Is it possible that the patterns still recognizable after optimization? On the whole, is this viable using LLVM or is it better to write a compiler for this particular DSL from scratch?

Thanks,
Shang-Yi Yang

Hi,

My personal opinion: Just to be sure I understand what you’re considering: you want to write a backend that will produce optimized machine code for a device with modular arithmetic instructions (not simulate such a device on a standard CPU)? In which case, won’t the same assumptions that are embodied in the transformations for the case of unsigned 2’s complement arithmetic (in your case with 256 bit unsigned words) with wraparound result in correct code? I suspect two issues will come up:

  1. The optimizations in LLVM which can reason about “useful for optimization” features of wrapping unsigned values are probably not particularly strong at the moment, since most interest tends to be in the case of optimizations that are valid in the case it’s known there won’t be a wraparound (eg, monotonically increasing of the result of adding elements). This is going to be just a lack of implementation rather than a fundamental issue.

  2. It’s not clear from your description whether changing the value in the “N register” is a very infrequent event or not. If it happens so infrequently you can get away with just putting compiler barriers before and after the change that would fit reasonably. On the other hand if it’s changed frequently getting performance might require allowing the optimizer to move instructions around significantly enough that it’s necessary to keep track of what the N register should be for each instruction. LLVM doesn’t really have any built-in notion of this, so you’re going to have to make significant changes to the compiler to do that. Whether this means writing your own compiler would be less work I don’t know.

Hope this helps,

Thanks for your insightful suggestions.

Yes, I am programming for a real device that does modular arithmetic (and
only modular arithmetic). The modulus N is fixed during a single launch of
a program. One way I could also come up with is to simply use add i256 %a,
%b to represent a + b mod n, and let LLVM passes to reason about possible
optimizations. However these are not semantically identical and it may
produce a wrong result, though in a rather low possibility.

Here is another issue I have been thinking these days. As far as I know,
legalizing a DAG node may expand the node to fit into the ISA. What happens
if a node is not representable by any combination of the instructions from
the ISA? An extremely opposite situation is that, two nodes must have some
particular semantic in the DAG so it can be translate to a single
instruction. For example, if MUL a, b of the ISA actually does a * b *
12345. So only

%1 = mul i256 %a, %b
%2 = mul i256 %1, i256 12345

can be translated into a legal instruction (and they may be scattered
around the DAG after some passes), but I think

%1 = mul i256 %a, %b

might fail to match. Is there any workaround?

Shang-Yi

Thanks for your insightful suggestions.

Yes, I am programming for a real device that does modular arithmetic (and
only modular arithmetic). The modulus N is fixed during a single launch of
a program. One way I could also come up with is to simply use add i256 %a,
%b to represent a + b mod n, and let LLVM passes to reason about possible
optimizations. However these are not semantically identical and it may
produce a wrong result, though in a rather low possibility.

Here is another issue I have been thinking these days. As far as I know,
legalizing a DAG node may expand the node to fit into the ISA. What happens
if a node is not representable by any combination of the instructions from
the ISA? An extremely opposite situation is that, two nodes must have some
particular semantic in the DAG so it can be translate to a single
instruction. For example, if MUL a, b of the ISA actually does a * b *
12345. So only

%1 = mul i256 %a, %b
%2 = mul i256 %1, i256 12345

can be translated into a legal instruction (and they may be scattered
around the DAG after some passes), but I think

%1 = mul i256 %a, %b

might fail to match. Is there any workaround?

I'm afraid you've rather lost me here: is your instruction (which we'll

call I for convenience) something where the "a * b * 12345" is all visible
at the user level, or is it something that should be conceptualised as
"multiplication of a and b within a finite field" and the 12345 is just an
implementation detail that the compiler just has to orchestrate?

In any case, having machine instructions which come from more than one IR
instruction is possible in LLVM, see eg how the use of fma (fused
multiply-add) instructions (when allowed by floating point operations) can
occur:

http://llvm.org/docs/CodeGenerator.html#id44

This is in the context of an optimization, I don't know how things would
work out in practice if there's no simpler instructions around as a
fallback if matching doesn't occur.

Cheers,
Dave