Determination of statements that contain only matrix multiplication

Hi Tobias,

could we use information about memory accesses of a SCoP statement and
def-use chains to determine statements, which don’t contain matrix
multiplication of the following form?

for (int i = 0; i < Upper Bound1; i++)
  for (int j = 0; j < Upper Bound2; j++)
    for (int k = 0; k < Upper Bound3; j++)
      C[i][j] += A[i][k] * B[k][j]

We could probably check that memory access relations have the following form:

"accesses" : [
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
  },
  {
    "kind" : "write",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
  }
]

and there are the following relations between access instructions of
the memory accesses:

AccessInst1 --- def-use --- \
AccessInst2 --- def-use --- fmul
                                            >
                                       def-use
                                            >
AccessInst3 --- def-use --- fadd
                                            >
                                       def-use
                                            >
                                 store (AccessInst4)

(fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
fmul and AccessInst3. store (AccessInst4) is a user of fadd)

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

I have one more question. Are memory accesses of MemAccs from the
ScopStmt class ordered by their sequence order?

Hi Tobias,

could we use information about memory accesses of a SCoP statement and
def-use chains to determine statements, which don’t contain matrix
multiplication of the following form?

Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.

for (int i = 0; i < Upper Bound1; i++)
  for (int j = 0; j < Upper Bound2; j++)
    for (int k = 0; k < Upper Bound3; j++)
      C[i][j] += A[i][k] * B[k][j]

We could probably check that memory access relations have the following form:

"accesses" : [
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
  },
  {
    "kind" : "write",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
  }
]

and there are the following relations between access instructions of
the memory accesses:

AccessInst1 --- def-use --- \
AccessInst2 --- def-use --- fmul
                                            >
                                       def-use
                                            >
AccessInst3 --- def-use --- fadd
                                            >
                                       def-use
                                            >
                                 store (AccessInst4)

(fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
fmul and AccessInst3. store (AccessInst4) is a user of fadd)

You'd also have to check which access are not there. E.g. the value of
C[i][j] might be used in the inner loop for some other purpose as
well, which would make some transformations illegal.

Also notice that LICM will hoist the C[i][j] reduction out of the
k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
undo this again.

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other scops?

Like the LoopIdiomPass, one could replace a detected gemm kernel my a
call to a hand-optimized/external [sdcz]gemm function. However, I
think the goal of Polly is to optimize such kernels by itself.

I have one more question. Are memory accesses of MemAccs from the
ScopStmt class ordered by their sequence order?

Only the MK_Array accesses should be in order relative to each other.
All scalar reads happen when entering a statement and all scalar
writes at the end of it.

Michael

Hi Tobias,

could we use information about memory accesses of a SCoP statement and
def-use chains to determine statements, which don’t contain matrix
multiplication of the following form?

Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.

for (int i = 0; i < Upper Bound1; i++)
  for (int j = 0; j < Upper Bound2; j++)
    for (int k = 0; k < Upper Bound3; j++)
      C[i][j] += A[i][k] * B[k][j]

We could probably check that memory access relations have the following form:

"accesses" : [
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
  },
  {
    "kind" : "write",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
  }
]

and there are the following relations between access instructions of
the memory accesses:

AccessInst1 --- def-use --- \
AccessInst2 --- def-use --- fmul
                                            >
                                       def-use
                                            >
AccessInst3 --- def-use --- fadd
                                            >
                                       def-use
                                            >
                                 store (AccessInst4)

(fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
fmul and AccessInst3. store (AccessInst4) is a user of fadd)

You'd also have to check which access are not there. E.g. the value of
C[i][j] might be used in the inner loop for some other purpose as
well, which would make some transformations illegal.

Also notice that LICM will hoist the C[i][j] reduction out of the
k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
undo this again.

Very right.

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other scops?

Good questions.

Like the LoopIdiomPass, one could replace a detected gemm kernel my a
call to a hand-optimized/external [sdcz]gemm function. However, I
think the goal of Polly is to optimize such kernels by itself.

Right. This does not seem to be too interesting.

I am personally mostly interested in getting the pieces in place that
allow us to get close-to-peak performance for gemm, but agree with
Michael that we clearly want to generalize this in the future.

However, going with a vertical implementation first might be a good idea
(assuming the pattern matching is simple). In this case we could detect
a gemm kernel and use this logic to get all important optimizations we
need in place.

Best,
Tobias

Thank you for the comments and answers!

Hi Tobias,

could we use information about memory accesses of a SCoP statement and
def-use chains to determine statements, which don’t contain matrix
multiplication of the following form?

Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.

for (int i = 0; i < Upper Bound1; i++)
  for (int j = 0; j < Upper Bound2; j++)
    for (int k = 0; k < Upper Bound3; j++)
      C[i][j] += A[i][k] * B[k][j]

We could probably check that memory access relations have the following form:

"accesses" : [
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
  },
  {
    "kind" : "write",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
  }
]

and there are the following relations between access instructions of
the memory accesses:

AccessInst1 --- def-use --- \
AccessInst2 --- def-use --- fmul
                                            >
                                       def-use
                                            >
AccessInst3 --- def-use --- fadd
                                            >
                                       def-use
                                            >
                                 store (AccessInst4)

(fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
fmul and AccessInst3. store (AccessInst4) is a user of fadd)

You'd also have to check which access are not there. E.g. the value of
C[i][j] might be used in the inner loop for some other purpose as
well, which would make some transformations illegal.

Also notice that LICM will hoist the C[i][j] reduction out of the
k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
undo this again.

Very right.

I agree with it. However, it could probably be used, if we were able
to apply only transformations, which can be done in general way.

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other stops?

I wasn’t sure whether llvm is able to hoist loads into SIMD registers
and sink stores from them. However, yesterday I found out that in some
cases llvm can perform such transformations.

For example, let’s consider the following code:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2 for pc = 0, ..., k−1 in steps of kc
                A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3 for jc = 0, ..., n−1 in steps of nc
                  B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4 for ir = 0, ..., mc −1 in steps of mr
Loop 5 for jr = 0, ..., nc −1 in steps of nr
Loop 6 for pr = 0, ..., kc −1 in steps of 1
Loop 7 for ir' = 0, ..., mr −1 in steps of 1
Loop 8 for jr' = 0, ..., nr −1 in steps of 1
                            C[ic * mr + i * mc + ir'][jc * nr + j * nc
+ jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
jr'];
                            endfor
                          endfor
                        endfor
                      endfor
                    endfor
                  endfor
                endfor
              endfor

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6 or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm

Like the LoopIdiomPass, one could replace a detected gemm kernel my a
call to a hand-optimized/external [sdcz]gemm function. However, I
think the goal of Polly is to optimize such kernels by itself.

Right. This does not seem to be too interesting.

I am personally mostly interested in getting the pieces in place that
allow us to get close-to-peak performance for gemm, but agree with
Michael that we clearly want to generalize this in the future.

However, going with a vertical implementation first might be a good idea
(assuming the pattern matching is simple). In this case we could detect
a gemm kernel and use this logic to get all important optimizations we
need in place.

Yes, it is planned for the future. Sorry, I should probably have
mentioned it in the proposal.

gemm_SIMD.c (8.19 KB)

IR (17.2 KB)

gemm_SIMD_with_prefetch.c (8.29 KB)

gemm_C_SIMD_another_packing.c (7.71 KB)

Thank you for the elaborate explanation, although I don't have time to
go through all of them.

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6 or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm

It would be great if you find a general way to handle such concerns,
maybe even modifying LLVM's passes. Since Tobias is your advisor, you
might discuss with him the details for which it is handling cases in
more general ways vs. only in detected gemm kernels.

AFAIK prefetch instructions are not automatically inserted anywhere in
LLVM. It would be nice to insert them X instructions/accesses in
advance, where X is determined somehow. For gemm only, one could do
some measurement for different architectures, but we don't get around
to have have it derived from some command-line argument in either
case.

Michael

Thank you for the ideas! I’ll try to take them into account.

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other stops?

I wasn’t sure whether llvm is able to hoist loads into SIMD registers
and sink stores from them. However, yesterday I found out that in some
cases llvm can perform such transformations.

For example, let’s consider the following code:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2 for pc = 0, ..., k−1 in steps of kc
                A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3 for jc = 0, ..., n−1 in steps of nc
                  B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4 for ir = 0, ..., mc −1 in steps of mr
Loop 5 for jr = 0, ..., nc −1 in steps of nr
Loop 6 for pr = 0, ..., kc −1 in steps of 1
Loop 7 for ir' = 0, ..., mr −1 in steps of 1
Loop 8 for jr' = 0, ..., nr −1 in steps of 1
                            C[ic * mr + i * mc + ir'][jc * nr + j * nc
+ jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
jr'];
                            endfor
                          endfor
                        endfor
                      endfor
                    endfor
                  endfor
                endfor
              endfor

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6

This is likely because LLVM does not understand the side-effects
of the prefetch instruction.

or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

I assume this is also due to the unknown side-effects of the packing
implementation.
(LLVM does not understand that the packing function-call does not modify
the C array).

Both can likely be addressed by improving aliasing-modref information,
adding relevant annotations or just hacking the code to ignore this
issue. Otherwise, we can probably also generate IR that makes it easier
for LLVM to perform this register promotion.

I am not fully sure where you are heading at the moment. Do you have
identified a set of individual tasks/changes that you could implement?

Best,
Tobias

Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in [1] and, according to [2], in some cases 90% of the
turbo boost peak of the processor can be attained without it.

I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework [3]. If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in [1] p. 11). For example,
its using helps to reduce the execution time of the attached
implementation.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
[3] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c

gemm_C_SIMD.c (5.56 KB)

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other stops?

I wasn’t sure whether llvm is able to hoist loads into SIMD registers
and sink stores from them. However, yesterday I found out that in some
cases llvm can perform such transformations.

For example, let’s consider the following code:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2 for pc = 0, ..., k−1 in steps of kc
                A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3 for jc = 0, ..., n−1 in steps of nc
                  B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4 for ir = 0, ..., mc −1 in steps of mr
Loop 5 for jr = 0, ..., nc −1 in steps of nr
Loop 6 for pr = 0, ..., kc −1 in steps of 1
Loop 7 for ir' = 0, ..., mr −1 in steps of 1
Loop 8 for jr' = 0, ..., nr −1 in steps of 1
                            C[ic * mr + i * mc + ir'][jc * nr + j * nc
+ jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
jr'];
                            endfor
                          endfor
                        endfor
                      endfor
                    endfor
                  endfor
                endfor
              endfor

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6

This is likely because LLVM does not understand the side-effects
of the prefetch instruction.

or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

I assume this is also due to the unknown side-effects of the packing
implementation.
(LLVM does not understand that the packing function-call does not modify
the C array).

Both can likely be addressed by improving aliasing-modref information,
adding relevant annotations or just hacking the code to ignore this
issue. Otherwise, we can probably also generate IR that makes it easier
for LLVM to perform this register promotion.

I think that this is right. Thank you for the explanation!

I am not fully sure where you are heading at the moment. Do you have
identified a set of individual tasks/changes that you could implement?

If I’m not mistaken, we decided that the pattern matching could be a
good starting point. That’s why I’m planning to try to implement it in
the near future. Implementation of tiling, interchanging and unrolling
of specific loops based on the algorithm presented in [1] could be the
next step. I could start working on the packing transformation, if the
previous tasks are implemented.

What do you think about it?

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

Sounds great.

Best,
Tobias

Sorry for not responding earlier.

Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in [1] and, according to [2], in some cases 90% of the
turbo boost peak of the processor can be attained without it.

Too many negations :slight_smile: I'm not sure I followed exactly what you wanted to say, but I understand that this is not the priority since you can get 90% of the performance without worrying about prefetching.

I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework [3]. If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in [1] p. 11). For example,
its using helps to reduce the execution time of the attached
implementation.

Interesting. The BLIS implementation prefetches only the first cache line, before traversing a given interval of memory. This clearly confirms the implementation relies on hardware preteching to prefetch the subsequent lines. This makes a lot of sense. Yet surprisingly, the BLIS implementation does not attempt at anticipating the fetch. It schedules the prefetch instruction right before the first load of a given interval.

Sorry for not responding earlier.

Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in [1] and, according to [2], in some cases 90% of the
turbo boost peak of the processor can be attained without it.

Too many negations :slight_smile: I'm not sure I followed exactly what you wanted to
say, but I understand that this is not the priority since you can get 90% of
the performance without worrying about prefetching.

Sorry for the misunderstanding. Yes, I think that if nobody minds,
prefetching couldn’t be the priority of this project, because for some
platforms we can get 90% of the performance without worrying about it.
Furthermore, as you mentioned before, hardware prefetchers can be good
at strided accesses in single-threaded code.

I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework [3]. If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in [1] p. 11). For example,
its using helps to reduce the execution time of the attached
implementation.

Interesting. The BLIS implementation prefetches only the first cache line,
before traversing a given interval of memory. This clearly confirms the
implementation relies on hardware preteching to prefetch the subsequent
lines. This makes a lot of sense.

Thank you for the explanation!

Yet surprisingly, the BLIS implementation
does not attempt at anticipating the fetch. It schedules the prefetch
instruction right before the first load of a given interval.

Yes, I think that it’s interesting.