Efficient Pattern matching in Instruction Combine

Hi,

All, Duncan, Rafael, David, Nick.

This is regarding pattern matching in InstructionCombine pass.

We use ‘match’ functions many times, but it doesn’t do the pattern matching

effectively.

e.x. Lets take pattern :

(A ^ B) | ((B ^ C) ^ A) → (A ^ B) | C

(B ^ A) | ((B ^ C) ^ A) → (A ^ B) | C

Both the patterns above are same, since ^ is commutative in Op0.

But, ‘match’ pattern has to be written for both the patterns separately

since ‘match’ identifies pattern as it (like regular expression) and doesn’t

have the logic to determine that ‘A^B’ and ‘B^A’ are same.

I propose to improve ‘match’ functions where we can include the logic

to determine if an operation is commutative or not. If it is commutative,

then a single ‘match’ call should be able to detect both ‘A^B’ and ‘B^A’.

So, basically we will modify the commutative ‘match’ functions.

There will be various permutations of it where one of the operand might be

a constant (I guess this is handled already as constant are re-associated to

RHS).

I will try to dig more on this.

Inputs/suggestions/comments on improving match functions are most awaited. :slight_smile:

Regards,

Suyog

Hi,

All, Duncan, Rafael, David, Nick.

This is regarding pattern matching in InstructionCombine pass.

We use 'match' functions many times, but it doesn't do the pattern matching
effectively.

e.x. Lets take pattern :

(A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C

(B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C

Both the patterns above are same, since ^ is commutative in Op0.

It'd be interesting if you could find a design that also treated these
the same:

    (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
    (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
    (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C

I.e., `^` is also associative.

Can we handle this by just having a canonical ordering? Or is that too difficult to maintain through various instcombines?

It seems to me that we could rejigger Reassociate to reduce the number of
possibilities that InstCombine could expect.

Hi Duncan, David, Sean.

Thanks for your reply.

It’d be interesting if you could find a design that also treated these
the same:

(B ^ A) | ((A ^ B) ^ C) → (A ^ B) | C

(B ^ A) | ((B ^ C) ^ A) → (A ^ B) | C

(B ^ A) | ((C ^ A) ^ B) → (A ^ B) | C

I.e., ^ is also associative.

Agree with Duncan on including associative operation too.

Can we handle this by just having a canonical ordering? Or is that too difficult to maintain through various instcombines?

Yes, its the easiest way to do that. If i am not wrong, what Sean is suggesting is that if we get

something like

(B ^ A) | ((B ^ C) ^ A) → (A ^ B) | C

and we have written pass for pattern

(A ^ B) | ((B ^ C) ^ A) → (A ^ B) | C

we could just swap ‘B’ and ‘A’ before matching the pattern i.e. we do canonical re-ordering
(Sean, Correct me if i am wrong on my understanding of what you meant by canonical ordering).

I have seen in the “instcombine” module that many times std::swap was used for the same.

But it becomes too much repetitive and somewhat irritating to keep doing this for every pattern.

Hi Duncan, David, Sean.

Thanks for your reply.

> It'd be interesting if you could find a design that also treated these
> the same:
>
> (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C

> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
> (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>
> I.e., `^` is also associative.

Agree with Duncan on including associative operation too.

> Can we handle this by just having a canonical ordering? Or is that too
difficult to maintain through various instcombines?

Yes, its the easiest way to do that. If i am not wrong, what Sean is
suggesting is that if we get

something like

(B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C

and we have written pass for pattern

(A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C

we could just swap 'B' and 'A' before matching the pattern i.e. we do
canonical re-ordering
(Sean, Correct me if i am wrong on my understanding of what you meant by
canonical ordering).

You understood correctly.

I have seen in the "instcombine" module that many times std::swap was used
for the same.

But it becomes too much repetitive and somewhat irritating to keep doing
this for every pattern.
Shouldn't we do this in more generic way where we do not have to worry
about swapping operands?
(How to do it - i don't know for now. I looked at the matchers
implementation for modification/improvement,
but didn't find any clue on it).

> It seems to me that we could rejigger Reassociate to reduce the number
of possibilities that InstCombine could expect.

Agree with David that infact it should be reassociate pass which should
handle this.

But following is something interesting and rather worrying things i have
found :
(I have omitted unimportant code and highlighted important one in
following example)
*e.x. *

((~A & B) | A) --> (A | B) ; Code is implemented for this already

*C code :*

suyog@suyog-Inspiron-N5010:~$ cat 1.c
#include<stdio.h>
int cal(int a, int b) {
*return ((~a & b) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}

LLVM IR at O0 :

suyog@suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm 1.c

; Function Attrs: nounwind

*define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32
%1, %b %3 = or i32 %2, %a ret i32 %3*
}

; Function Attrs: nounwind
define i32 @main() #0 {
  ..
  ..
  %4 = call i32 @cal(i32 %2, i32 %3)
  ret i32 %4
}

*LLVM IR at instcombine :*

suyog@suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll

*; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = or
i32 %a, %b ret i32 %1}*
..
..
..

which is OK as per expected transform.

*LLVM IR at reassociate and instcombine : *suyog@suyog-Inspiron-N5010:~$
Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll

*; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = xor
i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}*

..
..

Reassociate converted (~a&b) to (b&~a) and instruction combine failed to
optimize.

As evident, sometimes, reassociate pass actually harms the optimization
opportunity !!

Some more interesting findings on how can this affect the final machine
code:

*Test case :*
1.c :

#include<stdio.h>
int cal(int a, int b) {
*return ((~a & b) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}

*X86 .s file with clang at O2 for above program :*

suyog@suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c

main: # @main
# BB#0:
    subl $28, %esp
    leal 20(%esp), %eax
    movl %eax, 8(%esp)
    leal 24(%esp), %eax
    movl %eax, 4(%esp)
    movl $.L.str, (%esp)
    calll __isoc99_scanf
    movl 20(%esp), %eax
* orl 24(%esp), %eax*
    addl $28, %esp
    retl

As seen, optimization happened at IR level itself reflected in .s file.

*GCC output for the same:*

suyog@suyog-Inspiron-N5010:~$ gcc -S -O2 1.c

main:
.LFB23:
    .cfi_startproc
    pushl %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl %esp, %ebp
    .cfi_def_cfa_register 5
    andl $-16, %esp
    subl $32, %esp
    leal 28(%esp), %eax
    movl %eax, 8(%esp)
    leal 24(%esp), %eax
    movl %eax, 4(%esp)
    movl $.LC0, (%esp)
    call __isoc99_scanf
    movl 24(%esp), %eax
* orl 28(%esp), %eax*
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc

GCC also did the optimization.

Now we just *slightly flip* the test case :

*1.c Test case:*#include<stdio.h>
int cal(int a, int b) {
*return ((b & ~a) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}

*X86 .s file with clang at O2 for above program :*
suyog@suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c

main: # @main
# BB#0:
    subl $28, %esp
    leal 20(%esp), %eax
    movl %eax, 8(%esp)
    leal 24(%esp), %eax
    movl %eax, 4(%esp)
    movl $.L.str, (%esp)
    calll __isoc99_scanf
    movl 24(%esp), %ecx
    movl %ecx, %eax

*notl %eax andl 20(%esp), %eax orl %ecx, %eax*
    addl $28, %esp
    retl

We lost the Optimization opportunity here !!

*GCC output for the same:*

suyog@suyog-Inspiron-N5010:~$ gcc -S -O2 1.c

main:
.LFB23:
    .cfi_startproc
    pushl %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl %esp, %ebp
    .cfi_def_cfa_register 5
    andl $-16, %esp
    subl $32, %esp
    leal 28(%esp), %eax
    movl %eax, 8(%esp)
    leal 24(%esp), %eax
    movl %eax, 4(%esp)
    movl $.LC0, (%esp)
    call __isoc99_scanf
    movl 24(%esp), %eax
* orl 28(%esp), %eax*
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc

GCC still optimized it !!

Clearly evident that llvm is losing because of naive approach of pattern
matching.

How can we improve this so that LLVM also recognizes
commutative/associative
pattern efficiently with single 'matchers' call? I am not having any idea
so far.
I will think more on it. Any help would be really helpful.

Any suggestions/comments/rectification are most awaited :slight_smile:

I'm actually sort of wondering if instead of hardcoding all these bitwise
simplifications, we can just use a general boolean minimization algorithm.

Is there a reason we don't? Maybe it's just compile time? But eventually
(maybe already?) adding enough special case combines will be slower than
using the general optimization.

-- Sean

Re-adding the mailing list (remember to hit “reply all”)

Thanks Sean for the reference.

I will go through it and see if i can implement it for generic boolean expression minimization.

Regards,
Suyog

Thanks Sean for the reference.

I will go through it and see if i can implement it for generic boolean
expression minimization.

Hi Suyog,

One thing to keep in mind though is what you are trying to get out of all
this work, as my suggestion might be a lot of work and there could be
simpler solutions that meet your immediate needs.

What do you see as the end-result for instcombine's capability with boolean
expression simplification? Is there a particular goal that you are working
towards? The same also some of the other folks (coworkers of yours?) that
have been contributing similar instcombines lately.

Also remember that basically anything that can be called a boolean
minimization algorithm is going to have a worst-case exponential run time
in the number of variables. An easy way to see this is that if your
algorithm can decide "this expression simplifies to just False", then you
can use it to solve SAT: if the expression simplifies to just False, then
it is unsatisfiable, otherwise it is satisfiable.

-- Sean Silva

Even if you can't implement such an algorithm sanely, ISTM that
auto-generating this code from a table (or whatever), and choosing
canonical results (to avoid a fixpoint issue), rather than what seems
to be hand-additions of every possible set of minimizations on three
variables, is still a better solution, no?

At least then you wouldn't have human errors, and a growing file that
makes any of the non-trivial transforms easy to miss in the noise.

Even i realize now that it might take some time. Actually, i am trying to
get accustomed to whole LLVM
infrastructure by visiting every module, and submitting some patches
wherever i can find good opportunity
to improve the code.

While visiting instcombine pass and submitting patches for few boolean
patterns, i somehow found it
inconvenient to manually code for each and every permutation of the
pattern. Hence, my original
discussion was how can we improve matchers to reduce the code size (might
improve compile time as well).
'Reassociate' pass should do this, but as demonstrated earlier, it might
complicate things restricting 'instcombine'
to perform efficiently.

One way would be to write instcombine patches keeping in mind various
optimization opportunities
created by the reassociation pass.

e.x reassociate re-orders '~' to RHS. So ~a&b will be converted to b&~a.

Hence, whenever we write instcombine code, we should write -
match(m_Value(B), m_Not(m_Value(A)))
instead of match(m_Not(m_Value(A)), m_Value(B)).

Second approach might be to improve the matchers itself so that a single
call should be sufficient to
identify a&b and b&a (not sure at this time how can we do this).

Thinking on the similar line, the discussion got extended to general
boolean minimization :slight_smile:

My immediate goal is to improve matching mechanism.

Regards,
Suyog

That's what exactly i thought, basically to have auto generating minimized
expression,
instead of hand writing code for every pattern.

Few links i found -

http://sontrak.com/

http://sourceforge.net/projects/truthtablesolve/

This might take much more time to implement, my immediate goal is to
improve matching capabilities.

Regards,
Suyog

Even if you can't implement such an algorithm sanely, ISTM that
auto-generating this code from a table (or whatever), and choosing
canonical results (to avoid a fixpoint issue), rather than what seems
to be hand-additions of every possible set of minimizations on three
variables, is still a better solution, no?

At least then you wouldn't have human errors, and a growing file that
makes any of the non-trivial transforms easy to miss in the noise.

That's what exactly i thought, basically to have auto generating minimized
expression,
instead of hand writing code for every pattern.

Few links i found -

http://sontrak.com/

http://sourceforge.net/projects/truthtablesolve/

This might take much more time to implement, my immediate goal is to
improve matching capabilities.

Daniel's suggestion could actually could be done pretty easily. (assume for
now that we are most interested in boolean expressions over 3 variables).
There are two parts:

1. Add some code that evaluates a given 3-arg boolean expression at each of
the 2^3 == 8 values, creating an 8-bit index.

2. Create a table of 2^8 == 256 entries (not *too* much work to do by hand)
which has the "preferred" representation for each boolean function of 3
variables.

Then, you just use the 8-bit index to look in the table for the "preferred"
form of the expression. This approach could get a lot of "bang for your
buck" without much code complexity or implementation difficulty.

The same could be done for 2-variable expressions easily (might be a good
starting point). For 4-variable expressions, it might be more difficult
since the table would need 2^(2^4) == 16k entries.

-- Sean Silva

Daniel's suggestion could actually could be done pretty easily. (assume
for now that we are most interested in boolean expressions over 3
variables). There are two parts:

1. Add some code that evaluates a given 3-arg boolean expression at each
of the 2^3 == 8 values, creating an 8-bit index.

2. Create a table of 2^8 == 256 entries (not *too* much work to do by
hand) which has the "preferred" representation for each boolean function of
3 variables.

Then, you just use the 8-bit index to look in the table for the
"preferred" form of the expression. This approach could get a lot of "bang
for your buck" without much code complexity or implementation difficulty.

The same could be done for 2-variable expressions easily (might be a good
starting point). For 4-variable expressions, it might be more difficult
since the table would need 2^(2^4) == 16k entries.

Ok. Sounds good. I will try to come up with a patch. Thanks for the
suggestions.