memcmp code fragment

Hi,

Look at the following code:

Look at the following C code seqence:

unsigned char mainGtU ( unsigned int i1,
unsigned int i2,
unsigned char* block)
{
unsigned char c1, c2;
c1 = block[i1]; c2 = block[i2];

if (c1 != c2) return (c1 > c2);
i1++; i2++;

c1 = block[i1]; c2 = block[i2];
if (c1 != c2) return (c1 > c2);
i1++; i2++;


<repeat 12 times>

In LLVM IR it will be following:

define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget) local_unnamed_addr #0 {
entry:
%idxprom = zext i32 %i1 to i64
%arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom
%0 = load i8, i8* %arrayidx, align 1
%idxprom1 = zext i32 %i2 to i64
%arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1
%1 = load i8, i8* %arrayidx2, align 1
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %if.then

if.then: ; preds = %entry
%cmp7 = icmp ugt i8 %0, %1
br label %return

if.end: ; preds = %entry
%inc = add i32 %i1, 1
%inc10 = add i32 %i2, 1
%idxprom11 = zext i32 %inc to i64
%arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11
%2 = load i8, i8* %arrayidx12, align 1
%idxprom13 = zext i32 %inc10 to i64
%arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13
%3 = load i8, i8* %arrayidx14, align 1
%cmp17 = icmp eq i8 %2, %3
br i1 %cmp17, label %if.end25, label %if.then19

if.then19: ; preds = %if.end
%cmp22 = icmp ugt i8 %2, %3
br label %return



<repeats 12 times>

This code sequence can be collapsed into call to memcmp and we can get rid of basic blocks. I have written a small peephole optimization for squenece of instructions that identifies
branch termiantor, compare, load, gep etc and converts them to a call to memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86.

Is there a better idea, other than small peephole pass on IR to optimize this code?

Sirish

There is LoopIdiomRecognize which does transformations like this, but
only for loops, not unrolled code like your example.

It would be very cool if we could somehow make that pass also
recognize unrolled patterns, both for memcmp, and other operations.

I don't have any specific ideas for how to do that, but the
improvement you saw suggests it might be very worthwhile :slight_smile:

Thanks Hans.

I did look at LoopIdiomRecognize and also Loop rerolling - but like you said, they only work on loop. And having a control flow makes it a bit complicated. I have test cases that help in X86 and Aarch64. I will put up my patch for review and you guys can have a look.

Another thing, I noticed is that there is no __builtin_memcmp in llvm. Maybe that will be good to have as well.

Sirish

i think you’d have to write some idiom recognizer here. The ones we have won’t do it.

I guess my other question would be how commonly this happens. If it’s common and matters a lot, awesome.
I wouldn’t do it just to fix SPEC :stuck_out_tongue:

(people who care about bzip2 speed probably use any of the faster bzip2 impls :P)

Danny, do you know of forks of bzip2 that are more efficient than bzip.org?
I haven't seen one.

Sirish is going to send a patch to Julian Seward and try to get the change
in a new release of bzip2, and from there we may need to ask the AOSP
folks to update bzip2. AOSP has the last released bzip2 from Sept 2010.

For SPEC benchmarketing, doing the right thing wouldn't help.
Maybe we should put this idiom and others under a -fspec flag :wink:

Sebastian

> i think you'd have to write some idiom recognizer here. The ones we have
> won't do it.
>
> I guess my other question would be how commonly this happens. If it's
common
> and matters a lot, awesome.
> I wouldn't do it just to fix SPEC :stuck_out_tongue:
>
> (people who care about bzip2 speed probably use any of the faster bzip2
> impls :P)
>

Danny, do you know of forks of bzip2 that are more efficient than bzip.org
?
I haven't seen one.

They aren't forks so much as replacements.

An example: lbzip2

I know that there are a number of bzip2 impls that use faster versions of
the block sorting algorithm (or alternatives) etc.

The bzip2 (and how common this pattern is) discussion aside, I’m wondering why this transform is actually correct.

For example, in any of the blocks above, if c2 > c1 the function will return 0 - which is clearly different from what memcmp does. Or am I just missing something super obvious when reading the code.

I would do this, but might require heroic effort to recognize it. legality is already questionable allowing full array access (even if memcmp is lowered to conditional early exit just like original code). I don’t know if you can just turn original code into memcmp based on http://en.cppreference.com/w/c/string/byte/memcmp

auto a = memcmp(&block[i1], block[i2], 14);if (a == 0)
return ;
else
return a > 0;

Kevin