[GSoC 2016] Implementation of the packing transformation

[Resend on ML]

>> 1. Could we set MemoryAccess::NewAccessRelation to modify an access
>> relation of a memory access? As far as I understand,
>> MemoryAccess::NewAccessRelation is only used to import memory accesses
>> from JSCoPs
>
> I already use it in Polly-ACC to modify access relations. If you just
> want to modify the subscript expression, this should just work. Changing
> the accessed array may also already work, but is a lot less tested and
> the information is only partially updated such that further interface
> changes would be required.

Thank you for the information!

>> I think that MemoryAccess::NewAccessRelation could be used in the
>> implementation and also help to perform, for example, array expansion.
>
> Yes.
>
>> However, the modification would probably cause issues. For example, if
>> we apply tiling, we'll have to substitute old induction variables
>> before analyzing access functions.
>
> I do not understand this issue. Tiling is a schedule-only transformation
> that only considers data-dependences for legality. How do access
> functions come into play here? Why do induction variables need to be
> substituted.
>
> If you did not see this yet, any modification of the polyhedral access
> function will (except for non-affine accesses) directly be
> translated to LLVM-IR. This means, we need (almost) no code generation
> changes to make this work.

Yes, I agree, tiling is a schedule-only transformation that only
considers data-dependences for legality and any modification of the
polyhedral access function is directly translated to LLVM-IR.

I just wanted to say that if a schedule tree is modified in
ScheduleOptimizer and new induction variables are created (e.g. in
case of tiling), access relations and new access relations may stay
unchanged. It should probably be taken into account, if we were going
to modify an access relation according to its previous state.

For example, in case of matrix multiplication, we get the following isl
ast

if (1 && (&MemRef_7[1023][1056] <= &MemRef_5[0][0] ||
&MemRef_5[1055][1056] <= &MemRef_7[0][0]) && (&MemRef_6[1055][1024] <=
&MemRef_5[0][0] || &MemRef_5[1055][1056] <= &MemRef_6[0][0]))

    {
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          Stmt_12(c0, c1);
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          for (int c2 = 0; c2 <= 1023; c2 += 1)
            Stmt_17(c0, c1, c2);
    }

else
    { /* original code */ }

and the following memory access is generated (the corresponding new
memory access is NULL):

{ Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] }

If we try to create BLIS micro and macro kernels, we get the following
isl ast

  {
      // 1st level tiling - Tiles
      for (int c0 = 0; c0 <= 32; c0 += 1)
        for (int c1 = 0; c1 <= 32; c1 += 1) {
          // 1st level tiling - Points
          for (int c2 = 0; c2 <= 31; c2 += 1)
            for (int c3 = 0; c3 <= 31; c3 += 1)
              Stmt_12(32 * c0 + c2, 32 * c1 + c3);
        }
      // 1nd level tiling - Tiles
      for (int c0 = 0; c0 <= 65; c0 += 1)
        for (int c1 = 0; c1 <= 3; c1 += 1)
          for (int c2 = 0; c2 <= 10; c2 += 1) {
            // 1nd level tiling - Points
            // Register tiling - Tiles
            for (int c3 = 0; c3 <= 3; c3 += 1)
              for (int c4 = 0; c4 <= 11; c4 += 1)
                for (int c5 = 0; c5 <= 255; c5 += 1) {
                  // Register tiling - Points
                  // 1st level tiling - Tiles
                  // 1st level tiling - Points
                  {
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4, 256 * c1
                    + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 1,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 2,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 3,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 5,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 6,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 7,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                  }
                }
          }
    }

else
    { /* original code */ }

However, if we dump the memory access and the new memory access after
creation of the BLIS micro-kernel (e.g. after call of the
ScheduleTreeOptimizer::optimizeMatMulPattern), we'll see that they
remain the same.

In this case, we should probably substitute old induction variables,
if we wanted to determine that the memory access actually has the
following form

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_6[16 * c0 + 4 * c3 +
c6, 96 * c2 + 8 * c4, 256 * c1 + c5]

and replace it with, for example, the following memory access

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_x[4 * (c5 + 256 * c3) +
c6]

Memory accesses are always described in function of the original
iteration space dimensions. When code-generating the original (or
modified) memory accesses, the original iteration space dimensions are
automatically translated to the indiction variables introduced by the
new schedule.

This means if you want to perform a data-layout transformation, there
are two ways. Either you define a relation (isl_map) that maps old array
locations to new array locations and apply it to the existing access
functions, or you derive from scratch
a new isl map that relates statement instances in the original iteration
space to memory locations in the new array.

What you probably have more easily available is the mapping between
statement instances scheduled in the already tiled loop nest and the
memory locations they access. Such a mapping needs to be translated into
one of the earlier representations.

>> 2. What should we do to change an array referenced by a memory access?
>>
>> If I'm not mistaken, we can set the value of MemoryAccess::BasePtr to do
>> it.
>
> To just make it work, it should be sufficient to set an access function
> that has an isl_id that points to a different ScopArrayInfo object.
>
> See: IslExprBuilder::createAccessAddress
>
> BaseExpr = isl_ast_expr_get_op_arg(Expr, 0);
> BaseId = isl_ast_expr_get_id(BaseExpr);
> isl_ast_expr_free(BaseExpr);
>
> const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(BaseId);
> Base = SAI->getBasePtr();
>
> Now, just resetting the access function leaves the MemoryAccess in a
> rather inconsistent state. It would be good if we could think about what
> is needed to provide a consistent interface for such kind of
> functionality.

What do you mean by an inconsistent state? Is it a wrong value of, for
example, the following fields: MemoryAccess::BaseName,
MemoryAccess::Sizes, MemoryAccess::ElementType?

Right. getScopArrayInfo() might still return the original
getScopArrayInfo, whereas the access relation already refers to a
different array. For a first prototype, it is probably not necessary to
fix all this, but before upstreaming some of this, we
probably want to go over the interface and see if we improve the
situation.

>> 3. Could we copy data to a new array in IslNodeBuilder::createMark?
>>
>> We can probably create mark nodes, which contain references to memory
>> accesses that should be modified. Subsequently, using information
>> about original and new access relations, IslNodeBuilder::createMark
>> would generate functions that perform such copying.
>
> Why mark nodes?

If I'm not mistaken, position of a mark node in a schedule tree
determines a place in an isl ast, where it should be generated. I
think that it's convenient, if we want to specify where to perform
copying.

I understand that you want to mark a location in the schedule tree. One
option are mark nodes, the other option are extension nodes (see isl
documentation).

Please read through the documentation of extension nodes. I believe they
are better suited for what you are looking for, as they behave like any
other user statement.

> An option I have been thinking of was to create new -
> virtual ScopStmts that have specific semantics. E.g. ones that copy the
> content from one array location to another. We could test these using
> jscop and then use them in your schedule transformation to implement the
> packing.

Yes, maybe it's a better option. What are 'new virtual ScopStmts'? Is
it a new class, which we should add?

Currently we have normal ScopStmts and RegionStmts. We may now als
introduce CopyStmts, which implement the
mem-to-mem copying.

How to model this best is something we (you :wink: ) need to think about.
One option might indeed be to use classes to model these different kind
of statements.

Would its objects be created
during the work of ScheduleOptimizer?

Yes, this could make sense.

> Michael probably has some opinion here. He played in his delicm work
> with some changes that also require more complex modifications of memory
> accesses. There is no specific commit (AFAIU), but some changes are
> mixed in with his prototype work:
>
> https://github.com/Meinersbur/polly/commits/delicm2

Thank you for the information. I'll look for such changes.

Best,
Tobias

Thank you for the explanation and ideas!

>> However, the modification would probably cause issues. For example, if
>> we apply tiling, we'll have to substitute old induction variables
>> before analyzing access functions.
>
> I do not understand this issue. Tiling is a schedule-only transformation
> that only considers data-dependences for legality. How do access
> functions come into play here? Why do induction variables need to be
> substituted.
>
> If you did not see this yet, any modification of the polyhedral access
> function will (except for non-affine accesses) directly be
> translated to LLVM-IR. This means, we need (almost) no code generation
> changes to make this work.

Yes, I agree, tiling is a schedule-only transformation that only
considers data-dependences for legality and any modification of the
polyhedral access function is directly translated to LLVM-IR.

I just wanted to say that if a schedule tree is modified in
ScheduleOptimizer and new induction variables are created (e.g. in
case of tiling), access relations and new access relations may stay
unchanged. It should probably be taken into account, if we were going
to modify an access relation according to its previous state.

For example, in case of matrix multiplication, we get the following isl
ast

if (1 && (&MemRef_7[1023][1056] <= &MemRef_5[0][0] ||
&MemRef_5[1055][1056] <= &MemRef_7[0][0]) && (&MemRef_6[1055][1024] <=
&MemRef_5[0][0] || &MemRef_5[1055][1056] <= &MemRef_6[0][0]))

    {
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          Stmt_12(c0, c1);
      for (int c0 = 0; c0 <= 1055; c0 += 1)
        for (int c1 = 0; c1 <= 1055; c1 += 1)
          for (int c2 = 0; c2 <= 1023; c2 += 1)
            Stmt_17(c0, c1, c2);
    }

else
    { /* original code */ }

and the following memory access is generated (the corresponding new
memory access is NULL):

{ Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] }

If we try to create BLIS micro and macro kernels, we get the following
isl ast

  {
      // 1st level tiling - Tiles
      for (int c0 = 0; c0 <= 32; c0 += 1)
        for (int c1 = 0; c1 <= 32; c1 += 1) {
          // 1st level tiling - Points
          for (int c2 = 0; c2 <= 31; c2 += 1)
            for (int c3 = 0; c3 <= 31; c3 += 1)
              Stmt_12(32 * c0 + c2, 32 * c1 + c3);
        }
      // 1nd level tiling - Tiles
      for (int c0 = 0; c0 <= 65; c0 += 1)
        for (int c1 = 0; c1 <= 3; c1 += 1)
          for (int c2 = 0; c2 <= 10; c2 += 1) {
            // 1nd level tiling - Points
            // Register tiling - Tiles
            for (int c3 = 0; c3 <= 3; c3 += 1)
              for (int c4 = 0; c4 <= 11; c4 += 1)
                for (int c5 = 0; c5 <= 255; c5 += 1) {
                  // Register tiling - Points
                  // 1st level tiling - Tiles
                  // 1st level tiling - Points
                  {
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4, 256 * c1
                    + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 1,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 2,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 3,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 5,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 6,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 7,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4,
256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
1, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
2, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
3, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
4, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
5, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
6, 256 * c1 + c5);
                    Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
7, 256 * c1 + c5);
                  }
                }
          }
    }

else
    { /* original code */ }

However, if we dump the memory access and the new memory access after
creation of the BLIS micro-kernel (e.g. after call of the
ScheduleTreeOptimizer::optimizeMatMulPattern), we'll see that they
remain the same.

In this case, we should probably substitute old induction variables,
if we wanted to determine that the memory access actually has the
following form

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_6[16 * c0 + 4 * c3 +
c6, 96 * c2 + 8 * c4, 256 * c1 + c5]

and replace it with, for example, the following memory access

Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_x[4 * (c5 + 256 * c3) +
c6]

Memory accesses are always described in function of the original
iteration space dimensions. When code-generating the original (or
modified) memory accesses, the original iteration space dimensions are
automatically translated to the indiction variables introduced by the
new schedule.

This means if you want to perform a data-layout transformation, there
are two ways. Either you define a relation (isl_map) that maps old array
locations to new array locations and apply it to the existing access
functions, or you derive from scratch
a new isl map that relates statement instances in the original iteration
space to memory locations in the new array.

What you probably have more easily available is the mapping between
statement instances scheduled in the already tiled loop nest and the
memory locations they access. Such a mapping needs to be translated into
one of the earlier representations.

I'll try to implement it soon.

>> 2. What should we do to change an array referenced by a memory access?
>>
>> If I'm not mistaken, we can set the value of MemoryAccess::BasePtr to do
>> it.
>
> To just make it work, it should be sufficient to set an access function
> that has an isl_id that points to a different ScopArrayInfo object.
>
> See: IslExprBuilder::createAccessAddress
>
> BaseExpr = isl_ast_expr_get_op_arg(Expr, 0);
> BaseId = isl_ast_expr_get_id(BaseExpr);
> isl_ast_expr_free(BaseExpr);
>
> const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(BaseId);
> Base = SAI->getBasePtr();
>
> Now, just resetting the access function leaves the MemoryAccess in a
> rather inconsistent state. It would be good if we could think about what
> is needed to provide a consistent interface for such kind of
> functionality.

What do you mean by an inconsistent state? Is it a wrong value of, for
example, the following fields: MemoryAccess::BaseName,
MemoryAccess::Sizes, MemoryAccess::ElementType?

Right. getScopArrayInfo() might still return the original
getScopArrayInfo, whereas the access relation already refers to a
different array. For a first prototype, it is probably not necessary to
fix all this, but before upstreaming some of this, we
probably want to go over the interface and see if we improve the
situation.

I think it would be good.

>> 3. Could we copy data to a new array in IslNodeBuilder::createMark?
>>
>> We can probably create mark nodes, which contain references to memory
>> accesses that should be modified. Subsequently, using information
>> about original and new access relations, IslNodeBuilder::createMark
>> would generate functions that perform such copying.
>
> Why mark nodes?

If I'm not mistaken, position of a mark node in a schedule tree
determines a place in an isl ast, where it should be generated. I
think that it's convenient, if we want to specify where to perform
copying.

I understand that you want to mark a location in the schedule tree. One
option are mark nodes, the other option are extension nodes (see isl
documentation).

Please read through the documentation of extension nodes. I believe they
are better suited for what you are looking for, as they behave like any
other user statement.

> An option I have been thinking of was to create new -
> virtual ScopStmts that have specific semantics. E.g. ones that copy the
> content from one array location to another. We could test these using
> jscop and then use them in your schedule transformation to implement the
> packing.

Yes, maybe it's a better option. What are 'new virtual ScopStmts'? Is
it a new class, which we should add?

Currently we have normal ScopStmts and RegionStmts. We may now als
introduce CopyStmts, which implement the
mem-to-mem copying.

How to model this best is something we (you :wink: ) need to think about.
One option might indeed be to use classes to model these different kind
of statements.

OK. I'll try to find the best option.