Intel AMX programming model discussion.

Hi,

Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and accelerators able to operate on tiles. Capability of Intel AMX implementation is enumerated by palettes. Two palettes are supported: palette 0 represents the initialized state and palette 1 consists of 8 tile registers of up to 1 KB size, which is controlled by a tile control register.

The instruction manual is posted at https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.

The AMX abi proposal is posted at https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.

This email is to discuss the programming model for AMX. Florian has introduced the matrix type and intrinsics in LLVM community. We’d like to adopt some ideas from it.

Here is what we propose for the AMX programming model.

  1. Data type.

We’d like to have fixed vector type for AMX. Since the shape to AMX register can be configurable, the vector size is the maximum size of AMX register. That means the vector size is 1024 bytes.

The C code may look like this.

typedef int _tile_data attribute((vector_size(1024), aligned(64)));

_tile_data tile;

And the LLVM IR may look like this.

@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer, align 64

For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX registers.

  1. AMX Intrinsics.

The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n, k identifies the shape of the tile. The shape can be variable, but it cannot exceed the size that AMX HW can support. Compiler can deduce shape of the tile from the AMX intrinsics.

_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);

_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data tile);

  1. User interfaces.

The tile shape and tile data are combined into a struct in C language. The shape of the tile is only allowed to be initialized once. The user interface looks as this.

3 #define __DEFAULT_FN_AMX \

4 attribute((always_inline, nodebug, target(“amx-int8”)))

9 typedef struct __tile_str {

10 const char row;

11 const short col;

12 _tile_data tile;

13 }__tile;

14

15 __DEFAULT_FN_AMX

16 void __tile_loadd(__tile *dst, const void *base, long stride) {

17 dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);

18 }

19

20 __DEFAULT_FN_AMX

21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {

22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile, src1.tile, src2.tile);

23 }

24

25 __DEFAULT_FN_AMX

26 void __tile_stored(void *base, long stride, __tile src) {

27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);

28 }

  1. Example code

The example shows how to use the user interface in a function.

51 void api(int cond, short row, short col) {

52 __tile a = {row, col};

53 __tile b = {row, col};

54 __tile c = {row, col};

55

56 if(cond) {

57 __tile_loadd(&a, buf, STRIDE);

58 __tile_loadd(&b, buf, STRIDE);

59 __tile_loadd(&c, buf, STRIDE);

60 } else {

61 __tile_loadd(&a, buf2, STRIDE);

62 __tile_loadd(&b, buf2, STRIDE);

63 __tile_loadd(&c, buf2, STRIDE);

64 }

65 __tile_dpbsud(&c, a, b);

66 __tile_stored(buf, STRIDE, c);

67 }

  1. LLVM IR

The LLVM intrinsics IR take the row and column information as the input parameter, so that compiler can deduce the shape of tile data. The remaining parameters are what AMX instructions require. This is the LLVM IR corresponding to the example code.

12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col) local_unnamed_addr #2 {

13 entry:

14 %tobool = icmp eq i32 %cond, 0

15 %sext = shl i16 %col, 8

16 %conv.i31 = ashr exact i16 %sext, 8

17 br i1 %tobool, label %if.else, label %if.then

18

19 if.then: ; preds = %entry

20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

23 br label %if.end

24

25 if.else: ; preds = %entry

26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

29 br label %if.end

30

31 if.end: ; preds = %if.else, %if.then

32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]

33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]

34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]

35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3

36 tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3

37 ret void

38 }

  1. Shape propagation

When in -O0 build, some general load/store for tile vector is generated by front-end. We need to root from AMX intrinsics to propagate the shape information to the virtual tile register. If the an AMX intrinsic use the result of load instruction, the shape is propagated to the load and the load is transformed to tile load intrinsic. If the store instruction uses any result of AMX intrinsic, the shape is propagated to store instruction and the store is transformed to tile store intrinsic

  1. Machine IR

Since the AMX intrinsics take the row and column as the input parameters, we can create a pseudo instruction corresponding to it. The AMX intrinsics are lowered to the pseudo AMX instruction which has extra row and column operands corresponding to AMX intrinsic. The real AMX instructions don’t need the row and column operands. The row and column information should be configured by ldtilecfg before executing any AMX instruction.

  1. Register allocation

AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

  1. Use recommendation

Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.

Thanks

Yuanke

Hi,

Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and accelerators able to operate on tiles. Capability of Intel AMX implementation is enumerated by palettes. Two palettes are supported: palette 0 represents the initialized state and palette 1 consists of 8 tile registers of up to 1 KB size, which is controlled by a tile control register.

The instruction manual is posted at https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.

The AMX abi proposal is posted at https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.

This email is to discuss the programming model for AMX. Florian has introduced the matrix type and intrinsics in LLVM community. We’d like to adopt some ideas from it.

Here is what we propose for the AMX programming model.

1. Data type.

We’d like to have fixed vector type for AMX. Since the shape to AMX register can be configurable, the vector size is the maximum size of AMX register. That means the vector size is 1024 bytes.

The C code may look like this.

typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));

_tile_data tile;

And the LLVM IR may look like this.

@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer, align 64

For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX registers.

2.AMX Intrinsics.

The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n, k identifies the shape of the tile. The shape can be variable, but it cannot exceed the size that AMX HW can support. Compiler can deduce shape of the tile from the AMX intrinsics.

_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);

_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data tile);

3.User interfaces.

The tile shape and tile data are combined into a struct in C language. The shape of the tile is only allowed to be initialized once. The user interface looks as this.

  3 #define __DEFAULT_FN_AMX   \

  4 __attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))

  9 typedef struct __tile_str {

10Â Â const char row;

11Â Â const short col;

12Â Â _tile_data tile;

13 }__tile;

This interface look convenient, but what happens if one of these types appears on a function-call boundary? Does this force everything to be spilled and restored from the stack? Maybe this type needs some additional attribute to give it a custom register-passing convention?

14

15 __DEFAULT_FN_AMX

16 void __tile_loadd(__tile *dst, const void *base, long stride) {

17Â Â dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);

18 }

19

20 __DEFAULT_FN_AMX

21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {

22Â Â dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile, src1.tile, src2.tile);

23 }

24

25 __DEFAULT_FN_AMX

26 void __tile_stored(void *base, long stride, __tile src) {

27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);

28 }

4.Example code

The example shows how to use the user interface in a function.

 51 void api(int cond, short row, short col) {

52Â Â __tile a = {row, col};

53Â Â __tile b = {row, col};

54Â Â __tile c = {row, col};

55

56 Â Â if(cond) {

57 __tile_loadd(&a, buf, STRIDE);

58 __tile_loadd(&b, buf, STRIDE);

59 __tile_loadd(&c, buf, STRIDE);

60Â Â } else {

61 __tile_loadd(&a, buf2, STRIDE);

62 __tile_loadd(&b, buf2, STRIDE);

63 __tile_loadd(&c, buf2, STRIDE);

64Â Â }

65Â Â __tile_dpbsud(&c, a, b);

66Â Â __tile_stored(buf, STRIDE, c);

67 }

5.LLVM IR

The LLVM intrinsics IR take the row and column information as the input parameter, so that compiler can deduce the shape of tile data. The remaining parameters are what AMX instructions require. This is the LLVM IR corresponding to the example code.

12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col) local_unnamed_addr #2 {

13 entry:

14Â Â %tobool = icmp eq i32 %cond, 0

15Â Â %sext = shl i16 %col, 8

16Â Â %conv.i31 = ashr exact i16 %sext, 8

17Â Â br i1 %tobool, label %if.else, label %if.then

18

19 if.then:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %entry

20Â Â %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

21Â Â %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

22Â Â %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

23Â Â br label %if.end

24

25 if.else:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %entry

26Â Â %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

27Â Â %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

28Â Â %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

29Â Â br label %if.end

30

31 if.end:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %if.else, %if.then

32Â Â %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]

33Â Â %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]

34Â Â %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]

35Â Â %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3

36Â Â tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3

37Â Â ret void

38 }

6.Shape propagation

When in -O0 build, some general load/store for tile vector is generated by front-end. We need to root from AMX intrinsics to propagate the shape information to the virtual tile register. If the an AMX intrinsic use the result of load instruction, the shape is propagated to the load and the load is transformed to tile load intrinsic. If the store instruction uses any result of AMX intrinsic, the shape is propagated to store instruction and the store is transformed to tile store intrinsic

7.Machine IR

Since the AMX intrinsics take the row and column as the input parameters, we can create a pseudo instruction corresponding to it. The AMX intrinsics are lowered to the pseudo AMX instruction which has extra row and column operands corresponding to AMX intrinsic. The real AMX instructions don’t need the row and column operands. The row and column information should be configured by ldtilecfg before executing any AMX instruction.

8.Register allocation

AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?

How will the implementation of __builtin_setjmp/longjmp be affected?

Thanks again,

Hal

Hi,

Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and accelerators able to operate on tiles. Capability of Intel AMX implementation is enumerated by palettes. Two palettes are supported: palette 0 represents the initialized state and palette 1 consists of 8 tile registers of up to 1 KB size, which is controlled by a tile control register.

The instruction manual is posted at https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.

The AMX abi proposal is posted at https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.

This email is to discuss the programming model for AMX. Florian has introduced the matrix type and intrinsics in LLVM community. We’d like to adopt some ideas from it.

Here is what we propose for the AMX programming model.

1. Data type.

We’d like to have fixed vector type for AMX. Since the shape to AMX register can be configurable, the vector size is the maximum size of AMX register. That means the vector size is 1024 bytes.

The C code may look like this.

typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));

_tile_data tile;

And the LLVM IR may look like this.

@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer, align 64

For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX registers.

2.AMX Intrinsics.

The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n, k identifies the shape of the tile. The shape can be variable, but it cannot exceed the size that AMX HW can support. Compiler can deduce shape of the tile from the AMX intrinsics.

_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);

_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);

void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data tile);

3.User interfaces.

The tile shape and tile data are combined into a struct in C language. The shape of the tile is only allowed to be initialized once. The user interface looks as this.

  3 #define __DEFAULT_FN_AMX   \

  4 __attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))

  9 typedef struct __tile_str {

10Â Â const char row;

11Â Â const short col;

12Â Â _tile_data tile;

13 }__tile;

14

15 __DEFAULT_FN_AMX

16 void __tile_loadd(__tile *dst, const void *base, long stride) {

17Â Â dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);

18 }

19

20 __DEFAULT_FN_AMX

21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {

22Â Â dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile, src1.tile, src2.tile);

23 }

24

25 __DEFAULT_FN_AMX

26 void __tile_stored(void *base, long stride, __tile src) {

27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);

28 }

4.Example code

The example shows how to use the user interface in a function.

 51 void api(int cond, short row, short col) {

52Â Â __tile a = {row, col};

53Â Â __tile b = {row, col};

54Â Â __tile c = {row, col};

55

56 Â Â if(cond) {

57 __tile_loadd(&a, buf, STRIDE);

58 __tile_loadd(&b, buf, STRIDE);

59 __tile_loadd(&c, buf, STRIDE);

60Â Â } else {

61 __tile_loadd(&a, buf2, STRIDE);

62 __tile_loadd(&b, buf2, STRIDE);

63 __tile_loadd(&c, buf2, STRIDE);

64Â Â }

65Â Â __tile_dpbsud(&c, a, b);

66Â Â __tile_stored(buf, STRIDE, c);

67 }

5.LLVM IR

The LLVM intrinsics IR take the row and column information as the input parameter, so that compiler can deduce the shape of tile data. The remaining parameters are what AMX instructions require. This is the LLVM IR corresponding to the example code.

12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col) local_unnamed_addr #2 {

13 entry:

14Â Â %tobool = icmp eq i32 %cond, 0

15Â Â %sext = shl i16 %col, 8

16Â Â %conv.i31 = ashr exact i16 %sext, 8

17Â Â br i1 %tobool, label %if.else, label %if.then

18

19 if.then:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %entry

20Â Â %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

21Â Â %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

22Â Â %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3

23Â Â br label %if.end

24

25 if.else:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %entry

26Â Â %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

27Â Â %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

28Â Â %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3

29Â Â br label %if.end

30

31 if.end:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %if.else, %if.then

32Â Â %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]

33Â Â %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]

34Â Â %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]

35Â Â %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3

36Â Â tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3

37Â Â ret void

38 }

6.Shape propagation

When in -O0 build, some general load/store for tile vector is generated by front-end. We need to root from AMX intrinsics to propagate the shape information to the virtual tile register. If the an AMX intrinsic use the result of load instruction, the shape is propagated to the load and the load is transformed to tile load intrinsic. If the store instruction uses any result of AMX intrinsic, the shape is propagated to store instruction and the store is transformed to tile store intrinsic

7.Machine IR

Since the AMX intrinsics take the row and column as the input parameters, we can create a pseudo instruction corresponding to it. The AMX intrinsics are lowered to the pseudo AMX instruction which has extra row and column operands corresponding to AMX intrinsic. The real AMX instructions don’t need the row and column operands. The row and column information should be configured by ldtilecfg before executing any AMX instruction.

8.Register allocation

AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

This seems complicated.

Reading through the documentation, there appears to be a single global tile config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the tile instructions as having an implicit use of this register? That would seem to ensure that the register allocator has all the constraints needed. You'd need to teach it how to spill the special registers with the appropriate instructions, but that seems a lot more straight forward?

Hi,
Intel Advanced Matrix Extensions (Intel AMX) is a new programming paradigm consisting of two components: a set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image, and accelerators able to operate on tiles. Capability of Intel AMX implementation is enumerated by palettes. Two palettes are supported: palette 0 represents the initialized state and palette 1 consists of 8 tile registers of up to 1 KB size, which is controlled by a tile control register.
The instruction manual is posted at https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html.
The AMX abi proposal is posted at https://groups.google.com/g/x86-64-abi/c/NRejFm7pwb4.
This email is to discuss the programming model for AMX. Florian has introduced the matrix type and intrinsics in LLVM community. We'd like to adopt some ideas from it.
Here is what we propose for the AMX programming model.

1. Data type.
We'd like to have fixed vector type for AMX. Since the shape to AMX register can be configurable, the vector size is the maximum size of AMX register. That means the vector size is 1024 bytes.
The C code may look like this.
typedef int _tile_data __attribute__((__vector_size__(1024), __aligned__(64)));
_tile_data tile;
And the LLVM IR may look like this.
@tile = dso_local local_unnamed_addr global <256 x i32> zeroinitializer, align 64
For llvm IR, it is nice to have a new type x86_amxtile that can be mapped to AMX registers.

2. AMX Intrinsics.
The internal intrinsics are 1:1 mapped to AMX instructions. The parameter m, n, k identifies the shape of the tile. The shape can be variable, but it cannot exceed the size that AMX HW can support. Compiler can deduce shape of the tile from the AMX intrinsics.
_tile_data _tile_loadd_internal(char m, short n, const void *base, int stride);
_tile_data _tile_dpbssd_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);
_tile_data _tile_dpbf16ps_internal(char m, short n, short k, _tile_data dst, _tile_data src1, _tile_data src2);
void _tile_stored_internal(char m, short n, void *base, int stride, _tile_data tile);

3. User interfaces.
The tile shape and tile data are combined into a struct in C language. The shape of the tile is only allowed to be initialized once. The user interface looks as this.
   3 #define __DEFAULT_FN_AMX \
   4 __attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))
   9 typedef struct __tile_str {
10 const char row;
11 const short col;
12 _tile_data tile;
13 }__tile;

This interface look convenient, but what happens if one of these types appears on a function-call boundary? Does this force everything to be spilled and restored from the stack? Maybe this type needs some additional attribute to give it a custom register-passing convention?

[Yuanke] We prefer the tile data is passed through memory across function call, because passing though register is not as efficient as passing through memory. Compiler allocate the tile register and configure it in callee, and the tile register is re-configured in callee and all the tile data register is clear to zero. So yes, this force everything to be spilled and restored from the stack.
14
15 __DEFAULT_FN_AMX
16 void __tile_loadd(__tile *dst, const void *base, long stride) {
17 dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);
18 }
19
20 __DEFAULT_FN_AMX
21 void __tile_dpbsud(__tile *dst, __tile src1, __tile src2) {
22 dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile, src1.tile, src2.tile);
23 }
24
25 __DEFAULT_FN_AMX
26 void __tile_stored(void *base, long stride, __tile src) {
27 _tile_stored_internal(src.row, src.col, base, stride, src.tile);
28 }

4. Example code
The example shows how to use the user interface in a function.
51 void api(int cond, short row, short col) {
52 __tile a = {row, col};
53 __tile b = {row, col};
54 __tile c = {row, col};
55
56 if(cond) {
57 __tile_loadd(&a, buf, STRIDE);
58 __tile_loadd(&b, buf, STRIDE);
59 __tile_loadd(&c, buf, STRIDE);
60 } else {
61 __tile_loadd(&a, buf2, STRIDE);
62 __tile_loadd(&b, buf2, STRIDE);
63 __tile_loadd(&c, buf2, STRIDE);
64 }
65 __tile_dpbsud(&c, a, b);
66 __tile_stored(buf, STRIDE, c);
67 }

5. LLVM IR
The LLVM intrinsics IR take the row and column information as the input parameter, so that compiler can deduce the shape of tile data. The remaining parameters are what AMX instructions require. This is the LLVM IR corresponding to the example code.
12 define dso_local void @api(i32 %cond, i16 signext %row, i16 signext %col) local_unnamed_addr #2 {
13 entry:
14 %tobool = icmp eq i32 %cond, 0
15 %sext = shl i16 %col, 8
16 %conv.i31 = ashr exact i16 %sext, 8
17 br i1 %tobool, label %if.else, label %if.then
18
19 if.then: ; preds = %entry
20 %0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
21 %1 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
22 %2 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #3
23 br label %if.end
24
25 if.else: ; preds = %entry
26 %3 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
27 %4 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
28 %5 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf2, i64 0, i64 0), i64 32) #3
29 br label %if.end
30
31 if.end: ; preds = %if.else, %if.then
32 %a.sroa.1186.0 = phi <256 x i32> [ %3, %if.else ], [ %0, %if.then ]
33 %b.sroa.1068.0 = phi <256 x i32> [ %4, %if.else ], [ %1, %if.then ]
34 %c.sroa.1149.0 = phi <256 x i32> [ %5, %if.else ], [ %2, %if.then ]
35 %6 = tail call <256 x i32> @llvm.x86.tdpbssd(i16 %row, i16 %conv.i31, i16 %conv.i31, <256 x i32> %c.sroa.1149.0, <256 x i32> %a.sroa.1186.0, <256 x i32> %b.sroa.1068.0) #3
36 tail call void @llvm.x86.tilestored64(i16 %row, i16 %conv.i31, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, <256 x i32> %6) #3
37 ret void
38 }

6. Shape propagation
When in -O0 build, some general load/store for tile vector is generated by front-end. We need to root from AMX intrinsics to propagate the shape information to the virtual tile register. If the an AMX intrinsic use the result of load instruction, the shape is propagated to the load and the load is transformed to tile load intrinsic. If the store instruction uses any result of AMX intrinsic, the shape is propagated to store instruction and the store is transformed to tile store intrinsic

7. Machine IR
Since the AMX intrinsics take the row and column as the input parameters, we can create a pseudo instruction corresponding to it. The AMX intrinsics are lowered to the pseudo AMX instruction which has extra row and column operands corresponding to AMX intrinsic. The real AMX instructions don't need the row and column operands. The row and column information should be configured by ldtilecfg before executing any AMX instruction.

8. Register allocation
AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?

[Yuanke] I don't know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?

How will the implementation of __builtin_setjmp/longjmp be affected?

[Yuanke] That depends on the ABI. We propose all tile register is caller saved, so I think setjmp/longjmp is not affected.

Thanks again,

Hal

9. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.

Thanks
Yuanke

[Yuanke] AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

[Philip]

This seems complicated.

Reading through the documentation, there appears to be a single global tile config for all tile registers at any time.

Why not simply model this tile config as a designated special register and the tile instructions as having an implicit use of this register? That would seem to ensure that the register allocator has all the constraints needed. You’d need to teach it how to spill the special registers with the appropriate instructions, but that seems a lot more straight forward?

[Yuanke] In that case user need to configure the tile register by themselves. Spilling configure register is very expensive, because it clears all the tile data register to zero. In our proposal, compiler is responsible to deduce the shape for virtual of tile data register, allocate physical registers for them and then configure those physical register. We may build the dependency as you proposed and it can be used for machine IR check to ensure tile data register is configured before use.

*From:*Hal Finkel <hfinkel@anl.gov>
*Sent:* Friday, August 14, 2020 11:27 PM
*To:* Luo, Yuanke <yuanke.luo@intel.com>; llvm-dev@lists.llvm.org; florian_hahn@apple.com; Kaylor, Andrew <andrew.kaylor@intel.com>; Topper, Craig <craig.topper@intel.com>; Lu, Hongjiu <hongjiu.lu@intel.com>
*Subject:* Re: [llvm-dev] Intel AMX programming model discussion.

    Hi,

    ...

    8.Register allocation

    AMX register is special. It needs to be configured before use and
    the config instruction is expensive. To avoid unnecessary tile
    configure, we collect the tile shape information as much as
    possible and combine them into one ldtilecfg instruction. The
    ldtilecfg instruction should dominate any AMX instruction that
    access tile register. On the other side, the ldtilecfg should
    post-dominated the instruction that define the tile shape. For
    tile register spill, it should avoid re-config due to the
    different tile shape, the spilled register should be reloaded to
    the register that share the same tile shape. Since tile register
    allocation is special and it may allocate general virtual register
    to configure tile register, we can add a sperate pass to do it
    before general register allocation pass. After register
    allocation, the tile shape information is not needed anymore, so
    we can transform the pseudo AMX instruction to real AMX
    instruction by removing the row and column operands.

Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?

[Yuanke] I don’t know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?

Interestingly, it looks like some documentation was written but never committed: https://reviews.llvm.org/D23980 - in general, if you search for IPRA in LLVM, you'll see the relevant pieces. The really short description is that functions are emitted in topological order, leaves of the call graph first, so that customized clobber register masks can be attached to call sites of relevant internal functions.

 -Hal

I find your answer unconvincing. I’m not going to debate it as I don’t wish to take the time to build the appropriate context, but my initial response is skepticism.

Philip

Hi,
...

1. Register allocation
AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

Can you take advantage of our IPRA capability so that internal function calls might avoid this reconfiguration if the necessary configuration is always done in the caller?

[Yuanke] I don't know IPRA capability and I am very interesting on it. Would you post some linkage that introduce IPRA?

Interestingly, it looks like some documentation was written but never committed: https://reviews.llvm.org/D23980 - in general, if you search for IPRA in LLVM, you'll see the relevant pieces. The really short description is that functions are emitted in topological order, leaves of the call graph first, so that customized clobber register masks can be attached to call sites of relevant internal functions.

[Yuanke] Thank you. I think IPRA should help to reduce tile register re-config. I need more time to understand the detail of it. I also notice there is explicit cc discussion at http://lists.llvm.org/pipermail/llvm-dev/2019-January/129195.html, but it seems it doesn't land on LLVM.

-Hal

How will the implementation of __builtin_setjmp/longjmp be affected?

[Yuanke] That depends on the ABI. We propose all tile register is caller saved, so I think setjmp/longjmp is not affected.

Thanks again,

Hal

2. Use recommendation
Due to the shape configure issue, we recommend user to define the tile shape at the entry of the function entry and inline function as much as possible. The AMX instructions focus on computation instead of storage, so global variable for tile data is not recommended.

Thanks
Yuanke

Hi Philip,

Your idea make sense to me in my first thought. Thank you for the idea. I will take more time to think it over to see it can help to reduce the complexity of tile register allocation.

Yuanke

[Yuanke] AMX register is special. It needs to be configured before use and the config instruction is expensive. To avoid unnecessary tile configure, we collect the tile shape information as much as possible and combine them into one ldtilecfg instruction. The ldtilecfg instruction should dominate any AMX instruction that access tile register. On the other side, the ldtilecfg should post-dominated the instruction that define the tile shape. For tile register spill, it should avoid re-config due to the different tile shape, the spilled register should be reloaded to the register that share the same tile shape. Since tile register allocation is special and it may allocate general virtual register to configure tile register, we can add a sperate pass to do it before general register allocation pass. After register allocation, the tile shape information is not needed anymore, so we can transform the pseudo AMX instruction to real AMX instruction by removing the row and column operands.

Has some thought gone into how to make the config instruction less expensive?
I have, for a long time, thought that we need cleverer RAM.
E.g. A single read request that would, for example, return 64 bytes,
with each byte having been spaced out. I.e. Byte 1, skip 99 bytes,
Byte 2, skip 99 bytes Byte 3.
Or, instead of "read the next instruction", "read the next basic block
in one operation". (group of instructions).
This would massively reduce the amount of transactions between the CPU
and the RAM chips.
It would be the RAM chip itself that would do the operation, and not the CPU.
It could also be expanded to have the RAM chip do some simple
computations. E.g. Atomic loads/saves/counters/xor/not/xchg, if they
were cheap to do.
Essentially making the RAM chip able to work better, more efficiently,
with larger chunks of data per transaction.

Kind Regards

James

Sorry. I don't have deep knowledge of the design of HW, so I'm not able to answer the question.

The AMX registers are complicated. The single configuration register (which is mostly used implicitly, similar to MXCSR for floating point) controls the shape of all the tile registers, and if you change the tile configuration every single tile register is cleared. In practice, if we have to change the the configuration while any of the tile registers are live, performance is going to be terrible. We need to handle this case for correctness, but users of this programming interface will need to have enough awareness of the performance issues and the hardware details to prevent this. We’ll also want a diagnostic that lets the user know when this has happened.

When the tile configuration is set, the shape of each tile is locked in, so the individual tile registers aren’t interchangeable at that point. If a function needs 2x4 tiles, 4x2 tiles, and 4x4 tiles, the configuration needs to be set with this in mind. The shape isn’t explicit in every instruction and intrinsic. It must be deduced. And again, we’ll need a way to tell the user when efficient allocation can’t be done. In practice, I don’t expect any function to be using more than three tile shapes.

The implication of all this is that I don’t think the greedy register allocator is well suited to figure all of this out. We need a special pass to pre-allocate these registers. If the function is written in a way that makes good performance possible, it should be a relatively simple task to allocate everything with minimal spilling. If it isn’t possible to get good performance, we don’t need to do anything especially clever. We can just do something straightforward that is correct and let the user know that they aren’t going to be happy with the results.

-Andy

Hi, Andy,

I don’t quite understand everything that’s going on here. Could we model this as:

  1. Define a collection of register classes, one for 2x4 tiles, one for 4x2 tiles, etc. each populated with a set of tile registers. Registers can have aliasing relationships (instead of worrying of any kind of subregister/superregister relationships – these won’t be useful anyway).

  2. Define the tile-configuration instructions so that they implicitly define all of the registers in all of the classes.

Then you would still need to pre-schedule the tile operations as you’ve described, and collect the configuration information in order to add the ldtilecfgs, but the regular register allocator can handle the allocation itself in the usual way. What do you think?

-Hal

Hi Hal,

There is 3 aspect to be solved.

  1. The HW support max shape 16x16, so there are many register classes from 1x1 to 16x16. We need 256 register classes.

  2. We want to support variable shape, so compiler don’t know what register class to fit tile shape as it is only known in runtime.

  3. The tile configure is to configure physical tile register, so we need to allocate register and then we know the shape of each physical tile register and configure the tile register.

I think your suggestion is helpful to reduce the complexity if we only support fixed (constant) tile shape.

-Yuanke

Hi Hal,

There is 3 aspect to be solved.

1.The HW support max shape 16x16, so there are many register classes from 1x1 to 16x16. We need 256 register classes.

2.We want to support variable shape, so compiler don’t know what register class to fit tile shape as it is only known in runtime.

3.The tile configure is to configure physical tile register, so we need to allocate register and then we know the shape of each physical tile register and configure the tile register.

I think your suggestion is helpful to reduce the complexity if we only support fixed (constant) tile shape.

-Yuanke

Thanks, Yuanke.

It's not clear to me that having 256 register classes is, in itself, a problem. Is it?

What does it mean to support variable-shape tiles in this context? Do you do something other than conservatively assume that they are 16x16 for register-allocation purposes?

 -Hal

There is no problem to have 256 register classes. Just a lot of register classes to me.

We don’t assume the shape of each physical register be 16x16, it is defined by user. For variable shape, I mean the shape is known in runtime and in compile time the shape is unknown. Take below code as an example, the %row and %col are variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is %row x %col.

%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)

There is no problem to have 256 register classes. Just a lot of register classes to me.

We don’t assume the shape of each physical register be 16x16, it is defined by user. For variable shape, I mean the shape is known in runtime and in compile time the shape is unknown. Take below code as an example, the %row and %col are variable instead of constant. Compiler recognizes llvm.x86.tileloadd64 and deduce the shape of %0 is %row x %col.

%0 = tail call <256 x i32> @llvm.x86.tileloadd64(i16 %row, i16 %col, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32)

When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?

Thanks again,

Hal

We can do some basic analysis to classify the several groups of virtual tile registers. Each group share the same shape. So, the tile registers with the constant shape 16x16 are in a group. Within the group the register can be allocated by general RA scheme. To your question, we do the allocation for this case in the same way even if we knew the size was 16x16.

When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?

I think what will happen is that the registers are allocated based on a number of runtime values that are assumed to be different from one another but less than or equal to 16. So, for example, we’ll allocate registers for MxN tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime the values of these variables will be used to create the actual tile configuration. The instructions that need to know the shape take these runtime values as operands.

There may be some artifacts coming from the front end that conservatively assume a 16x16 tile, but I think those generally go away in SROA or later specialized passes. Yuanke can confirm or correct my understanding of this.

> When the tile shape is unknown at compile time, how do you plan to do the register allocation of the tiles? My question is: do you do the allocation for this case in the same way as you would if you knew the size was 16x16 (i.e., conservatively assume the largest size)?

I think what will happen is that the registers are allocated based on a number of runtime values that are assumed to be different from one another but less than or equal to 16. So, for example, we’ll allocate registers for MxN tiles, NxM tiles and MxM tiles without knowing what M and N are. Then at runtime the values of these variables will be used to create the actual tile configuration. The instructions that need to know the shape take these runtime values as operands.

So you're going to multiversion the code?

In any case, my point is that you probably don't need a custom register allocator. If you just define the tile registers and make sure that the ldtilecfgs implicitly defines them all, then the regular infrastructure likely works. You'll have a bunch of register classes, but that's not necessarily a problem. I recommend trying this, and let us know what you discover, before we go down the road of a new, dedicated allocator just for these registers.

 -Hal