[GSoC 2016] Implementation of the packing transformation

1. Could we set MemoryAccess::NewAccessRelation to modify an access
relation of a memory access? As far as I understand,
MemoryAccess::NewAccessRelation is only used to import memory accesses
from JSCoPs

I already use it in Polly-ACC to modify access relations. If you just
want to modify the subscript expression, this should just work. Changing
the accessed array may also already work, but is a lot less tested and
the information is only partially updated such that further interface
changes would be required.

Thank you for the information!

I think that MemoryAccess::NewAccessRelation could be used in the
implementation and also help to perform, for example, array expansion.

Yes.

However, the modification would probably cause issues. For example, if
we apply tiling, we'll have to substitute old induction variables
before analyzing access functions.

I do not understand this issue. Tiling is a schedule-only transformation
that only considers data-dependences for legality. How do access
functions come into play here? Why do induction variables need to be
substituted.

If you did not see this yet, any modification of the polyhedral access
function will (except for non-affine accesses) directly be
translated to LLVM-IR. This means, we need (almost) no code generation
changes to make this work.

Yes, I agree, tiling is a schedule-only transformation that only
considers data-dependences for legality and any modification of the
polyhedral access function is directly translated to LLVM-IR.

I just wanted to say that if a schedule tree is modified in
ScheduleOptimizer and new induction variables are created (e.g. in
case of tiling), access relations and new access relations may stay
unchanged. It should probably be taken into account, if we were going
to modify an access relation according to its previous state.

For example, in case of matrix multiplication, we get the following isl ast

if (1 && (&MemRef_7[1023][1056] <= &MemRef_5[0][0] ||
&MemRef_5[1055][1056] <= &MemRef_7[0][0]) && (&MemRef_6[1055][1024] <=
&MemRef_5[0][0] || &MemRef_5[1055][1056] <= &MemRef_6[0][0]))

    {
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          Stmt_12(c0, c1);
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          for (int c2 = 0; c2 <= 1023; c2 += 1)
            Stmt_17(c0, c1, c2);
    }

else
    { /* original code */ }

and the following memory access is generated (the corresponding new
memory access is NULL):

{ Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] }

If we try to create BLIS micro and macro kernels, we get the following isl ast

  {
      // 1st level tiling - Tiles
      for (int c0 = 0; c0 <= 32; c0 += 1)
        for (int c1 = 0; c1 <= 32; c1 += 1) {
          // 1st level tiling - Points
          for (int c2 = 0; c2 <= 31; c2 += 1)
            for (int c3 = 0; c3 <= 31; c3 += 1)
              Stmt_12(32 * c0 + c2, 32 * c1 + c3);
        }
      // 1nd level tiling - Tiles
      for (int c0 = 0; c0 <= 65; c0 += 1)
        for (int c1 = 0; c1 <= 3; c1 += 1)
          for (int c2 = 0; c2 <= 10; c2 += 1) {
            // 1nd level tiling - Points
            // Register tiling - Tiles
            for (int c3 = 0; c3 <= 3; c3 += 1)
              for (int c4 = 0; c4 <= 11; c4 += 1)
                for (int c5 = 0; c5 <= 255; c5 += 1) {
                  // Register tiling - Points
                  // 1st level tiling - Tiles
                  // 1st level tiling - Points
                  {
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 1,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 2,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 3,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 5,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 6,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 7,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                  }
                }
          }
    }

else
    { /* original code */ }

However, if we dump the memory access and the new memory access after
creation of the BLIS micro-kernel (e.g. after call of the
ScheduleTreeOptimizer::optimizeMatMulPattern), we'll see that they
remain the same.

In this case, we should probably substitute old induction variables,
if we wanted to determine that the memory access actually has the
following form

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_6[16 * c0 + 4 * c3 +
c6, 96 * c2 + 8 * c4, 256 * c1 + c5]

and replace it with, for example, the following memory access

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_x[4 * (c5 + 256 * c3) + c6]

I'm not sure whether it's possible
to do it in ScheduleOptimizer. Another possible issue could arise
during import from JSCoPs. In this case, we probably shouldn't modify
imported access relations.

You mean we should not overwrite the changes that have been imported? I
think overwriting them is OK. If the user prefers to not run schedule
optimizations after the input, he can always disable the schedule
optimizer when importing changes.

OK.

2. What should we do to change an array referenced by a memory access?

If I'm not mistaken, we can set the value of MemoryAccess::BasePtr to do
it.

To just make it work, it should be sufficient to set an access function
that has an isl_id that points to a different ScopArrayInfo object.

See: IslExprBuilder::createAccessAddress

  BaseExpr = isl_ast_expr_get_op_arg(Expr, 0);
  BaseId = isl_ast_expr_get_id(BaseExpr);
  isl_ast_expr_free(BaseExpr);

  const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(BaseId);
  Base = SAI->getBasePtr();

Now, just resetting the access function leaves the MemoryAccess in a
rather inconsistent state. It would be good if we could think about what
is needed to provide a consistent interface for such kind of
functionality.

What do you mean by an inconsistent state? Is it a wrong value of, for
example, the following fields: MemoryAccess::BaseName,
MemoryAccess::Sizes, MemoryAccess::ElementType?

Also, having jscop test cases to test this functionality would be great.

Yes, I think it would be good.

3. Could we copy data to a new array in IslNodeBuilder::createMark?

We can probably create mark nodes, which contain references to memory
accesses that should be modified. Subsequently, using information
about original and new access relations, IslNodeBuilder::createMark
would generate functions that perform such copying.

Why mark nodes?

If I'm not mistaken, position of a mark node in a schedule tree
determines a place in an isl ast, where it should be generated. I
think that it's convenient, if we want to specify where to perform
copying.

An option I have been thinking of was to create new -
virtual ScopStmts that have specific semantics. E.g. ones that copy the
content from one array location to another. We could test these using
jscop and then use them in your schedule transformation to implement the
packing.

Yes, maybe it's a better option. What are 'new virtual ScopStmts'? Is
it a new class, which we should add? Would its objects be created
during the work of ScheduleOptimizer?

Michael probably has some opinion here. He played in his delicm work
with some changes that also require more complex modifications of memory
accesses. There is no specific commit (AFAIU), but some changes are
mixed in with his prototype work:

https://github.com/Meinersbur/polly/commits/delicm2

Thank you for the information. I'll look for such changes.