On Loop Distribution pass

Dear community,

Our team at IITH have been experimenting with loop-distribution pass in LLVM. We see the following results on few benchmarks.

clang -O3 -mllvm -enable-loop-distribute -Rpass=loop-distribute file.c

clang -O3 -mllvm -enable-loop-distribute -Rpass-analysis=loop-distribute file.c

TORCH:

There are nearly 488 loops in this benchmark. LLVM was not able to distribute any loop.

TSVC:

There are 151 loops coded in plain ‘C’, none of them got distributed. TSVC has particularly candidates valid for distribution like the one below. The inner loop in this example can be distributed.

for (int nl = 0; nl < ntimes/2; nl++) {

for (int i = 1; i < LEN; i++) {

a[i] += c[i] * d[i];

b[i] = b[i - 1] + a[i] + d[i];

}

dummy(a, b, c, d, e, aa, bb, cc, 0.);

}

MiBench:

There are 6539 loops in MiBench. None of the loops were distributed by the loop-distribution pass of LLVM.

CoMD:

CoMD is a reference implementation of typical classical molecular dynamics algorithms and workloads.

Out of 112 loops none of them are distributed.

For the specific loop I have shown above which is straight forward a good distribution candidate, the remark is “loop not distributed: memory operations are safe for vectorization [-Rpass-analysis=loop-distribute]”.

Can someone reason these results and this remark?

I humbly request the community to correct me, if am missing something in my analysis.

From: "Dangeti Tharun kumar via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Cc: "Santanu Das" <cs15mtech11018@iith.ac.in>
Sent: Sunday, October 9, 2016 12:09:01 PM
Subject: [llvm-dev] On Loop Distribution pass

Dear community,

Our team at IITH have been experimenting with loop-distribution pass
in LLVM. We see the following results on few benchmarks.

clang -O3 -mllvm -enable-loop-distribute -Rpass=loop-distribute
file.c
clang -O3 -mllvm -enable-loop-distribute
-Rpass-analysis=loop-distribute file.c

TORCH :
There are nearly 488 loops in this benchmark. LLVM was not able to
distribute any loop.

TSVC :
There are 151 loops coded in plain ‘C’, none of them got distributed.
TSVC has particularly candidates valid for distribution like the one
below. The inner loop in this example can be distributed.

for (int nl = 0; nl < ntimes/2; nl++) {
for (int i = 1; i < LEN; i++) {
a[i] += c[i] * d[i];
b[i] = b[i - 1] + a[i] + d[i];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}

MiBench :
There are 6539 loops in MiBench. None of the loops were distributed
by the loop-distribution pass of LLVM.

CoMD :
CoMD is a reference implementation of typical classical molecular
dynamics algorithms and workloads.
Out of 112 loops none of them are distributed.

For the specific loop I have shown above which is straight forward a
good distribution candidate, the remark is " loop not distributed:
memory operations are safe for vectorization
[-Rpass-analysis=loop-distribute] ".

Can someone reason these results and this remark?

The current loop distribution implementation is focused on enabling vectorization. The comment at the top of the file says:

// This file implements the Loop Distribution Pass. Its main focus is to
// distribute loops that cannot be vectorized due to dependence cycles. It
// tries to isolate the offending dependences into a new loop allowing
// vectorization of the remaining parts.

Hopefully, this helps explain the diagnostic. Regarding why the loop you highlight is not distributed to enable the vectorization of the a[i] update, I don't know. Adam?

-Hal


From: “Dangeti Tharun kumar via llvm-dev” <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Cc: “Santanu Das” <cs15mtech11018@iith.ac.in>
Sent: Sunday, October 9, 2016 12:09:01 PM
Subject: [llvm-dev] On Loop Distribution pass

Dear community,

Our team at IITH have been experimenting with loop-distribution pass in LLVM. We see the following results on few benchmarks.

clang -O3 -mllvm -enable-loop-distribute -Rpass=loop-distribute file.c
clang -O3 -mllvm -enable-loop-distribute -Rpass-analysis=loop-distribute file.c

TORCH:
There are nearly 488 loops in this benchmark. LLVM was not able to distribute any loop.

TSVC:
There are 151 loops coded in plain ‘C’, none of them got distributed. TSVC has particularly candidates valid for distribution like the one below. The inner loop in this example can be distributed.

for (int nl = 0; nl < ntimes/2; nl++) {
for (int i = 1; i < LEN; i++) {
a[i] += c[i] * d[i];
b[i] = b[i - 1] + a[i] + d[i];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}

MiBench:
There are 6539 loops in MiBench. None of the loops were distributed by the loop-distribution pass of LLVM.

CoMD:
CoMD is a reference implementation of typical classical molecular dynamics algorithms and workloads.
Out of 112 loops none of them are distributed.

For the specific loop I have shown above which is straight forward a good distribution candidate, the remark is “loop not distributed: memory operations are safe for vectorization [-Rpass-analysis=loop-distribute]”.

Can someone reason these results and this remark?

The current loop distribution implementation is focused on enabling vectorization. The comment at the top of the file says:

// This file implements the Loop Distribution Pass. Its main focus is to
// distribute loops that cannot be vectorized due to dependence cycles. It
// tries to isolate the offending dependences into a new loop allowing
// vectorization of the remaining parts.

Hopefully, this helps explain the diagnostic. Regarding why the loop you highlight is not distributed to enable the vectorization of the a[i] update, I don’t know. Adam?

Yeah, it’s not immediate clear to me why we’re not distributing that. I think it’s s221() from TSVC. I will take a look later.

Adam

From: "Adam Nemet" <anemet@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Dangeti Tharun kumar" <cs15mtech11002@iith.ac.in>, "Santanu Das"
<cs15mtech11018@iith.ac.in>, llvm-dev@lists.llvm.org
Sent: Monday, October 10, 2016 4:52:18 PM
Subject: Re: [llvm-dev] On Loop Distribution pass

> > From: "Dangeti Tharun kumar via llvm-dev" <
> > llvm-dev@lists.llvm.org
> > >
>

> > To: llvm-dev@lists.llvm.org
>

> > Cc: "Santanu Das" < cs15mtech11018@iith.ac.in >
>

> > Sent: Sunday, October 9, 2016 12:09:01 PM
>

> > Subject: [llvm-dev] On Loop Distribution pass
>

> > Dear community,
>

> > Our team at IITH have been experimenting with loop-distribution
> > pass
> > in LLVM. We see the following results on few benchmarks.
>

> > clang -O3 -mllvm -enable-loop-distribute -Rpass=loop-distribute
> > file.c
>

> > clang -O3 -mllvm -enable-loop-distribute
> > -Rpass-analysis=loop-distribute file.c
>

> > TORCH :
>

> > There are nearly 488 loops in this benchmark. LLVM was not able
> > to
> > distribute any loop.
>

> > TSVC :
>

> > There are 151 loops coded in plain ‘C’, none of them got
> > distributed.
> > TSVC has particularly candidates valid for distribution like the
> > one
> > below. The inner loop in this example can be distributed.
>

> > for (int nl = 0; nl < ntimes/2; nl++) {
>

> > for (int i = 1; i < LEN; i++) {
>

> > a[i] += c[i] * d[i];
>

> > b[i] = b[i - 1] + a[i] + d[i];
>

> > }
>

> > dummy(a, b, c, d, e, aa, bb, cc, 0.);
>

> > }
>

> > MiBench :
>

> > There are 6539 loops in MiBench. None of the loops were
> > distributed
> > by the loop-distribution pass of LLVM.
>

> > CoMD :
>

> > CoMD is a reference implementation of typical classical molecular
> > dynamics algorithms and workloads.
>

> > Out of 112 loops none of them are distributed.
>

> > For the specific loop I have shown above which is straight
> > forward
> > a
> > good distribution candidate, the remark is " loop not
> > distributed:
> > memory operations are safe for vectorization
> > [-Rpass-analysis=loop-distribute] ".
>

> > Can someone reason these results and this remark?
>

> The current loop distribution implementation is focused on enabling
> vectorization. The comment at the top of the file says:

> // This file implements the Loop Distribution Pass. Its main focus
> is
> to

> // distribute loops that cannot be vectorized due to dependence
> cycles. It

> // tries to isolate the offending dependences into a new loop
> allowing

> // vectorization of the remaining parts.

> Hopefully, this helps explain the diagnostic. Regarding why the
> loop
> you highlight is not distributed to enable the vectorization of the
> a[i] update, I don't know. Adam?

Yeah, it’s not immediate clear to me why we’re not distributing that.
I think it’s s221() from TSVC. I will take a look later.

Yes, it looks like s221.

-Hal


From: “Adam Nemet” <anemet@apple.com>
To: “Hal Finkel” <hfinkel@anl.gov>
Cc: “Dangeti Tharun kumar” <cs15mtech11002@iith.ac.in>, “Santanu Das” <cs15mtech11018@iith.ac.in>, llvm-dev@lists.llvm.org
Sent: Monday, October 10, 2016 4:52:18 PM
Subject: Re: [llvm-dev] On Loop Distribution pass


From: “Dangeti Tharun kumar via llvm-dev” <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Cc: “Santanu Das” <cs15mtech11018@iith.ac.in>
Sent: Sunday, October 9, 2016 12:09:01 PM
Subject: [llvm-dev] On Loop Distribution pass

Dear community,

Our team at IITH have been experimenting with loop-distribution pass in LLVM. We see the following results on few benchmarks.

clang -O3 -mllvm -enable-loop-distribute -Rpass=loop-distribute file.c
clang -O3 -mllvm -enable-loop-distribute -Rpass-analysis=loop-distribute file.c

TORCH:
There are nearly 488 loops in this benchmark. LLVM was not able to distribute any loop.

TSVC:
There are 151 loops coded in plain ‘C’, none of them got distributed. TSVC has particularly candidates valid for distribution like the one below. The inner loop in this example can be distributed.

for (int nl = 0; nl < ntimes/2; nl++) {
for (int i = 1; i < LEN; i++) {
a[i] += c[i] * d[i];
b[i] = b[i - 1] + a[i] + d[i];
}

There are two problems why we can’t currently distribute this:

  1. Load-PRE in GVN replaces the b[i-1] with the value from the previous iterations.

This is the reason that we no longer see the backward dependence and report the above message. The plan is to disable loop Load-PRE in GVN and have LoopLoadElimination handle this which is run after LoopDist. Unfortunately even with disabling load-pre in GVN we have this:

  1. We can’t currently duplicate load operations across the resulting loops (a[I] and d[I] in the loop above).

This constraint had to do with the limitation in the dependence analysis (LAA) at the time. This is no longer a problem because we now represent all dependences in the loop thus we don’t need to preserve the original program order of the memory accesses.

Adam

Hi,

This is interesting data. I’m wondering if you could figure out (in at least some of the instances) when and why loop distribution would speed-up a benchmark by how much? Apart from some technicalities the major difficulty why LD is off by default is the cost model. Your analysis could shed more light on what it should be.

Thanks,
Gerolf