Hi everyone,
I have a few questions regarding computations that involve constant tensors in MLIR.
1)
Let’s consider this small fragment of MLIR:
module {
func.func @func() -> tensor<2x2xf32> {
%0 = tensor.empty() : tensor<2x2xf32>
%cst = arith.constant dense<[[1.0, 2.0], [3.0, 4.0]]> : tensor<2x2xf32>
%transposed = linalg.transpose ins(%cst : tensor<2x2xf32>) outs(%0 : tensor<2x2xf32>) permutation = [1, 0]
return %transposed : tensor<2x2xf32>
}
}
Which simply performs transpose() on a constant 2*2 tensor.
Transformation to MLIR-LLVM (LLVM embedded in an MLIR dialect) dialect leads to:
module attributes {llvm.target_triple = "hexagon"} {
llvm.mlir.global private constant @__constant_2x2xf32(dense<[[1.000000e+00, 3.000000e+00], [2.000000e+00, 4.000000e+00]]> : tensor<2x2xf32>) {addr_space = 0 : i32, alignment = 64 : i64} : !llvm.array<2 x array<2 x f32>>
llvm.func @func() -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> attributes {llvm.emit_c_interface} {
%0 = llvm.mlir.constant(0 : index) : i64
%1 = llvm.mlir.undef : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%2 = llvm.mlir.constant(3735928559 : index) : i64
%3 = llvm.mlir.addressof @__constant_2x2xf32 : !llvm.ptr
%4 = llvm.mlir.constant(2 : index) : i64
%5 = llvm.mlir.constant(1 : index) : i64
%6 = llvm.getelementptr %3[0, 0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.array<2 x array<2 x f32>>
%7 = llvm.inttoptr %2 : i64 to !llvm.ptr
[...]
Where we see that the dense tensor [[1.0, 2.0], [3.0, 4.0]] has been evaluated to [[1.000000e+00, 3.000000e+00], [2.000000e+00, 4.000000e+00]] at compile time, so the transpose() has been evaluated at compile time. Great!
However, if the constant tensor is defined using the “dialect_resources” as in:
module {
func.func @func() -> tensor<2x2xf32> {
%0 = tensor.empty() : tensor<2x2xf32>
%cst = arith.constant dense_resource<tensor_2_2.float32> : tensor<2x2xf32>
%transposed = linalg.transpose ins(%cst : tensor<2x2xf32>) outs(%0 : tensor<2x2xf32>) permutation = [1, 0]
return %transposed : tensor<2x2xf32>
}
}
{-#
dialect_resources: {
builtin: {
tensor_2_2.float32: "0x040000009FC167BEE27636BDADC405BE3115353D"
}
}
#-}
Then we obtain code that seems to perform the transpose() at runtime, and the constant tensor (defined via dialect_resources) stays the same as the one provided:
module attributes {llvm.target_triple = "hexagon"} {
llvm.func @malloc(i64) -> !llvm.ptr
llvm.mlir.global private constant @__constant_2x2xf32(dense_resource<tensor_2_2.float32> : tensor<2x2xf32>) {addr_space = 0 : i32, alignment = 64 : i64} : !llvm.array<2 x array<2 x f32>>
llvm.func @func() -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> attributes {llvm.emit_c_interface} {
%0 = llvm.mlir.constant(64 : index) : i64
%1 = llvm.mlir.undef : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%2 = llvm.mlir.addressof @__constant_2x2xf32 : !llvm.ptr
%3 = llvm.mlir.constant(1 : index) : i64
%4 = llvm.mlir.constant(2 : index) : i64
%5 = llvm.mlir.constant(0 : index) : i64
%6 = llvm.mlir.zero : !llvm.ptr
%7 = llvm.getelementptr %2[0, 0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.array<2 x array<2 x f32>>
%8 = llvm.getelementptr %6[4] : (!llvm.ptr) -> !llvm.ptr, f32
%9 = llvm.ptrtoint %8 : !llvm.ptr to i64
%10 = llvm.add %9, %0 : i64
%11 = llvm.call @malloc(%10) : (i64) -> !llvm.ptr
%12 = llvm.ptrtoint %11 : !llvm.ptr to i64
%13 = llvm.sub %0, %3 : i64
%14 = llvm.add %12, %13 : i64
%15 = llvm.urem %14, %0 : i64
%16 = llvm.sub %14, %15 : i64
%17 = llvm.inttoptr %16 : i64 to !llvm.ptr
%18 = llvm.insertvalue %11, %1[0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%19 = llvm.insertvalue %17, %18[1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%20 = llvm.insertvalue %5, %19[2] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%21 = llvm.insertvalue %4, %20[3, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%22 = llvm.insertvalue %4, %21[3, 1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%23 = llvm.insertvalue %4, %22[4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
%24 = llvm.insertvalue %3, %23[4, 1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
llvm.br ^bb1(%5 : i64)
^bb1(%25: i64): // 2 preds: ^bb0, ^bb5
%26 = llvm.icmp "slt" %25, %4 : i64
llvm.cond_br %26, ^bb2, ^bb6
^bb2: // pred: ^bb1
llvm.br ^bb3(%5 : i64)
^bb3(%27: i64): // 2 preds: ^bb2, ^bb4
%28 = llvm.icmp "slt" %27, %4 : i64
llvm.cond_br %28, ^bb4, ^bb5
^bb4: // pred: ^bb3
%29 = llvm.mul %27, %4 : i64
%30 = llvm.add %29, %25 : i64
%31 = llvm.getelementptr %7[%30] : (!llvm.ptr, i64) -> !llvm.ptr, f32
%32 = llvm.load %31 : !llvm.ptr -> f32
%33 = llvm.mul %25, %4 : i64
%34 = llvm.add %33, %27 : i64
%35 = llvm.getelementptr %17[%34] : (!llvm.ptr, i64) -> !llvm.ptr, f32
llvm.store %32, %35 : f32, !llvm.ptr
%36 = llvm.add %27, %3 : i64
llvm.br ^bb3(%36 : i64)
^bb5: // pred: ^bb3
%37 = llvm.add %25, %3 : i64
llvm.br ^bb1(%37 : i64)
^bb6: // pred: ^bb1
llvm.return %24 : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
}
llvm.func @_mlir_ciface_func(%arg0: !llvm.ptr) attributes {llvm.emit_c_interface} {
%0 = llvm.call @func() : () -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
llvm.store %0, %arg0 : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>, !llvm.ptr
llvm.return
}
}
{-#
dialect_resources: {
builtin: {
tensor_2_2.float32: "0x040000009FC167BEE27636BDADC405BE3115353D"
}
}
#-}
My questions here are:
- Why the passes that deal with constants folding / constant simplification can’t handle constants defined with the dialect_resources?
- Would it be reasonable to extend them to take care of the dialect_resources?
I don’t know much about this dialect_resources dialect for constants. I only know that when dealing with Torch models that have been transformed via torch-mlir, we end up with models in MLIR that contain constants defined with this dialect_ressources.
2) Where exactly is this dialect_resources dialect being defined? I’m asking both in terms of the documentation that describes it (and in particular the exact meaning of the first 4 bytes), and in terms of where it’s being implemented.
I did find the file llvm-project/mlir/include/mlir/IR/AsmState.h at main · llvm/llvm-project · GitHub but everything is still pretty cryptic.
I also found the file llvm-project/mlir/unittests/Parser/ResourceTest.cpp at main · llvm/llvm-project · GitHub that implements some tests of renaming, where I can guess a bit the organization of the encoding. In particular:
"test.use1"() {attr = #test.e1di64_elements<blob1> : tensor<3xi64> } : () -> ()
{-#
dialect_resources: {
test: {
blob1: "0x08000000010000000000000002000000000000000300000000000000"
}
}
#-}
seems to suggest that there’s 4 bytes of meta-data (08000000 here) and then 8 bytes (=16 hex digits) for each int64:
0100000000000000 for the first int64, 0200000000000000 for the second int64, and 0300000000000000 for the third int64.
But how are we exactly supposed to interpret the meta-data (0x08000000), which I assume somehow informs about the number of elements (3) and their type (int32)?
3)
Imagine the more general situation where you’ve got some code, in whatever format (probably some C++ code, compiled for a specific target, say Hexagon, and provided as part of the runtime) that performs some rearrangement of data.
I’m not sure of the mechanism for calling external functions, but something like this likely exists in MLIR I assume.
Now, when this rearrangement of data is applied to some constant tensor, it would be reasonable to desire that these computation happen at compile time.
What would be the best way to obtain that? Is there some infrastructure already in place in the MLIR ecosystem for dealing with that? The important thing here is that these computations could be anything, and can potentially be external to the MLIR ecosystem (i.e., not implemented within any dialect)
If there’s nothing that exists for dealing with that currently, what would you suggest to take in account when designing something like this? I would like it to be general, so that people with various needs of computations on constants could use it. Many architectures, both GPUs and Domain Specific Processors expect data to be arranged in some specific format for some operations, so I’m thinking something like that could be useful for various targets. Am I correct here, is there interest from the community for that?
I was thinking of implementing something that would flag the constants terms, export such constant definitions to a new module, compiling for the host (x86), running it on the host (x86), mutating the AST with the result, and continuing with the rest of the normal flow for compiling for the target. Does that sounds reasonable, or do you see better approaches?
Any suggestion is of course very welcome!
Many thanks.