Computations involving constants, constant folding and dialect_resources

FranckQC · December 12, 2024, 8:39pm

Hi everyone,

I have a few questions regarding computations that involve constant tensors in MLIR.

1)
Let’s consider this small fragment of MLIR:

module {
  func.func @func() -> tensor<2x2xf32> {
    %0 = tensor.empty() : tensor<2x2xf32>
    %cst = arith.constant dense<[[1.0, 2.0], [3.0, 4.0]]> : tensor<2x2xf32>
    %transposed = linalg.transpose ins(%cst : tensor<2x2xf32>) outs(%0 : tensor<2x2xf32>) permutation = [1, 0] 
    return %transposed : tensor<2x2xf32>
  }
}

Which simply performs transpose() on a constant 2*2 tensor.

Transformation to MLIR-LLVM (LLVM embedded in an MLIR dialect) dialect leads to:

module attributes {llvm.target_triple = "hexagon"} {
  llvm.mlir.global private constant @__constant_2x2xf32(dense<[[1.000000e+00, 3.000000e+00], [2.000000e+00, 4.000000e+00]]> : tensor<2x2xf32>) {addr_space = 0 : i32, alignment = 64 : i64} : !llvm.array<2 x array<2 x f32>>
  llvm.func @func() -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> attributes {llvm.emit_c_interface} {
    %0 = llvm.mlir.constant(0 : index) : i64
    %1 = llvm.mlir.undef : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
    %2 = llvm.mlir.constant(3735928559 : index) : i64
    %3 = llvm.mlir.addressof @__constant_2x2xf32 : !llvm.ptr
    %4 = llvm.mlir.constant(2 : index) : i64
    %5 = llvm.mlir.constant(1 : index) : i64
    %6 = llvm.getelementptr %3[0, 0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.array<2 x array<2 x f32>>
    %7 = llvm.inttoptr %2 : i64 to !llvm.ptr
    [...]

Where we see that the dense tensor [[1.0, 2.0], [3.0, 4.0]] has been evaluated to [[1.000000e+00, 3.000000e+00], [2.000000e+00, 4.000000e+00]] at compile time, so the transpose() has been evaluated at compile time. Great!

However, if the constant tensor is defined using the “dialect_resources” as in:

module {
  func.func @func() -> tensor<2x2xf32> {
    %0 = tensor.empty() : tensor<2x2xf32>
    %cst = arith.constant dense_resource<tensor_2_2.float32> : tensor<2x2xf32>
    %transposed = linalg.transpose ins(%cst : tensor<2x2xf32>) outs(%0 : tensor<2x2xf32>) permutation = [1, 0] 
    return %transposed : tensor<2x2xf32>
  }
}

{-#
  dialect_resources: {
    builtin: {
          tensor_2_2.float32: "0x040000009FC167BEE27636BDADC405BE3115353D"
    }
  }
#-}

Then we obtain code that seems to perform the transpose() at runtime, and the constant tensor (defined via dialect_resources) stays the same as the one provided:

module attributes {llvm.target_triple = "hexagon"} {
  llvm.func @malloc(i64) -> !llvm.ptr
  llvm.mlir.global private constant @__constant_2x2xf32(dense_resource<tensor_2_2.float32> : tensor<2x2xf32>) {addr_space = 0 : i32, alignment = 64 : i64} : !llvm.array<2 x array<2 x f32>>
  llvm.func @func() -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> attributes {llvm.emit_c_interface} {
    %0 = llvm.mlir.constant(64 : index) : i64
    %1 = llvm.mlir.undef : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
    %2 = llvm.mlir.addressof @__constant_2x2xf32 : !llvm.ptr
    %3 = llvm.mlir.constant(1 : index) : i64
    %4 = llvm.mlir.constant(2 : index) : i64
    %5 = llvm.mlir.constant(0 : index) : i64
    %6 = llvm.mlir.zero : !llvm.ptr
    %7 = llvm.getelementptr %2[0, 0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.array<2 x array<2 x f32>>
    %8 = llvm.getelementptr %6[4] : (!llvm.ptr) -> !llvm.ptr, f32
    %9 = llvm.ptrtoint %8 : !llvm.ptr to i64
    %10 = llvm.add %9, %0 : i64
    %11 = llvm.call @malloc(%10) : (i64) -> !llvm.ptr
    %12 = llvm.ptrtoint %11 : !llvm.ptr to i64
    %13 = llvm.sub %0, %3 : i64
    %14 = llvm.add %12, %13 : i64
    %15 = llvm.urem %14, %0  : i64
    %16 = llvm.sub %14, %15 : i64
    %17 = llvm.inttoptr %16 : i64 to !llvm.ptr
    %18 = llvm.insertvalue %11, %1[0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %19 = llvm.insertvalue %17, %18[1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %20 = llvm.insertvalue %5, %19[2] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %21 = llvm.insertvalue %4, %20[3, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %22 = llvm.insertvalue %4, %21[3, 1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %23 = llvm.insertvalue %4, %22[4, 0] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    %24 = llvm.insertvalue %3, %23[4, 1] : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)> 
    llvm.br ^bb1(%5 : i64)
  ^bb1(%25: i64):  // 2 preds: ^bb0, ^bb5
    %26 = llvm.icmp "slt" %25, %4 : i64
    llvm.cond_br %26, ^bb2, ^bb6
  ^bb2:  // pred: ^bb1
    llvm.br ^bb3(%5 : i64)
  ^bb3(%27: i64):  // 2 preds: ^bb2, ^bb4
    %28 = llvm.icmp "slt" %27, %4 : i64
    llvm.cond_br %28, ^bb4, ^bb5
  ^bb4:  // pred: ^bb3
    %29 = llvm.mul %27, %4 : i64
    %30 = llvm.add %29, %25 : i64
    %31 = llvm.getelementptr %7[%30] : (!llvm.ptr, i64) -> !llvm.ptr, f32
    %32 = llvm.load %31 : !llvm.ptr -> f32
    %33 = llvm.mul %25, %4 : i64
    %34 = llvm.add %33, %27 : i64
    %35 = llvm.getelementptr %17[%34] : (!llvm.ptr, i64) -> !llvm.ptr, f32
    llvm.store %32, %35 : f32, !llvm.ptr
    %36 = llvm.add %27, %3 : i64
    llvm.br ^bb3(%36 : i64)
  ^bb5:  // pred: ^bb3
    %37 = llvm.add %25, %3 : i64
    llvm.br ^bb1(%37 : i64)
  ^bb6:  // pred: ^bb1
    llvm.return %24 : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
  }
  llvm.func @_mlir_ciface_func(%arg0: !llvm.ptr) attributes {llvm.emit_c_interface} {
    %0 = llvm.call @func() : () -> !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>
    llvm.store %0, %arg0 : !llvm.struct<(ptr, ptr, i64, array<2 x i64>, array<2 x i64>)>, !llvm.ptr
    llvm.return
  }
}

{-#
  dialect_resources: {
    builtin: {
      tensor_2_2.float32: "0x040000009FC167BEE27636BDADC405BE3115353D"
    }
  }
#-}

My questions here are:

Why the passes that deal with constants folding / constant simplification can’t handle constants defined with the dialect_resources?
Would it be reasonable to extend them to take care of the dialect_resources?

I don’t know much about this dialect_resources dialect for constants. I only know that when dealing with Torch models that have been transformed via torch-mlir, we end up with models in MLIR that contain constants defined with this dialect_ressources.

2) Where exactly is this dialect_resources dialect being defined? I’m asking both in terms of the documentation that describes it (and in particular the exact meaning of the first 4 bytes), and in terms of where it’s being implemented.

I did find the file llvm-project/mlir/include/mlir/IR/AsmState.h at main · llvm/llvm-project · GitHub but everything is still pretty cryptic.

I also found the file llvm-project/mlir/unittests/Parser/ResourceTest.cpp at main · llvm/llvm-project · GitHub that implements some tests of renaming, where I can guess a bit the organization of the encoding. In particular:

    "test.use1"() {attr = #test.e1di64_elements<blob1> : tensor<3xi64> } : () -> ()

    {-#
      dialect_resources: {
        test: {
          blob1: "0x08000000010000000000000002000000000000000300000000000000"
        }
      }
    #-}

seems to suggest that there’s 4 bytes of meta-data (08000000 here) and then 8 bytes (=16 hex digits) for each int64:
0100000000000000 for the first int64, 0200000000000000 for the second int64, and 0300000000000000 for the third int64.

But how are we exactly supposed to interpret the meta-data (0x08000000), which I assume somehow informs about the number of elements (3) and their type (int32)?

3)
Imagine the more general situation where you’ve got some code, in whatever format (probably some C++ code, compiled for a specific target, say Hexagon, and provided as part of the runtime) that performs some rearrangement of data.
I’m not sure of the mechanism for calling external functions, but something like this likely exists in MLIR I assume.

Now, when this rearrangement of data is applied to some constant tensor, it would be reasonable to desire that these computation happen at compile time.

What would be the best way to obtain that? Is there some infrastructure already in place in the MLIR ecosystem for dealing with that? The important thing here is that these computations could be anything, and can potentially be external to the MLIR ecosystem (i.e., not implemented within any dialect)
If there’s nothing that exists for dealing with that currently, what would you suggest to take in account when designing something like this? I would like it to be general, so that people with various needs of computations on constants could use it. Many architectures, both GPUs and Domain Specific Processors expect data to be arranged in some specific format for some operations, so I’m thinking something like that could be useful for various targets. Am I correct here, is there interest from the community for that?

I was thinking of implementing something that would flag the constants terms, export such constant definitions to a new module, compiling for the host (x86), running it on the host (x86), mutating the AST with the result, and continuing with the rest of the normal flow for compiling for the target. Does that sounds reasonable, or do you see better approaches?

Any suggestion is of course very welcome!
Many thanks.

stellaraccident · December 13, 2024, 5:25am

DenseElementsAttr is basically deprecated as a means of storing arbitrary sized tensor constants. They are still used in various contexts and make sense for certain “small” constants like splats, etc. But basically, because they are uniqued in the context, they perform very badly (I’ve seen cases where typical compiler pipelines using them for bulk tensor storage can take on the order of hours vs seconds if using resources). There were some threads on this a while ago, but afaik, all serious users of the infra stopped using DenseElementsAttr for these purposes a few years ago. Some compilers go further and don’t even use inlined weights as resources at all, preferring to compile against their own external storage that can be mounted at runtime (IREE does this and just directly mounts safetensors, gguf, its own format, or in memory storage).

Sadly, I think the docs for resources are a bit less robust than some other parts of the infra. Perhaps @jpienaar or @River707, who did a lot of the work, know of a better stash of docs.

No idea. I expect that someone added some trivial folders based on DenseElementsAttr and they were never removed. In our projects, we’ve had to disable all such things because the compile time is atrocious if ever hit.

IREE basically takes this approach. Here is its pass that outlines such a module, JITs and inlines results at compile time. It relies on quite a few things not present in upstream MLIR:

An optimized CPU compilation pipeline (we’ve long since passed the point where even constant folding can be done in a reasonable time without a fairly optimized host compiler).
Globals/load/store as the canonical form for bulk tensor data.
Module level initializers for performing load time initialization of globals.
A compiler driver which can receive a callback for invoking itself recursively.
Some kind of const-expr outlining (IREE uses this pass and corresponding analysis in order to hoist eligible const-expr trees into module level initializers, which the JitGlobals pass can choose to evaluate at compile time).

There is another mode that we use this in where it can produce a new device-specific parameter pack for cases like you describe. It is a relatively fiddly bit of infrastructure and fairly tied to IREE. It works well enough as a “big hammer” which is pretty general and handles everything. We’re basically always fiddling with the analysis that identifies profitable expression trees – and that is more just getting the heuristics right and the result of having worked on it for a long time. Then the runtime integration tends to be fiddly when it comes to the large array of data types, etc that are always cropping up and need special handling.

I’ve long thought that a proper infrastructure like linalg should integrate a library like xtensor to provide a set of passes for doing more eager folding at compile time, especially for the large set of small/medium tensors that tend to unlock additional optimizations (i.e. shape/indexing/etc). I have a feeling that a set of passes like that could be made relatively modular/extensible and would serve a lot of people. However, it is just so hard and time consuming to contribute anything out of the ordinary to MLIR that we’ve never had the budget or stamina to do it. If someone ever wanted to work on something like that, at least as an optional component, it seems like it could be done.

I couldn’t advise on how best to go about contributing a constant JIT engine, though. Upstream MLIR is just missing so much basic infrastructure and has no culture of the integration tests that are needed to make such a thing any good (I can’t stress enough that making this kind of thing good is like 95% CI and having a good, comprehensive test suite – that I’m not sure I would even accept patches to the project which tried to build it without that).

I’ve made some comments that might elicit debate about project structure and charters, and I kindly request that if wanting to debate that further, it be done on the proper thread. The bottom line in my opinion is that the project structure itself lacks a component where such a constant-jitter would fit, and it lacks the production infra for testing needed to support such a thing. For the eager consteval, I mentioned, I think that if the deps/optionality were handled, we could probably find a place in the current project to build such a thing if needed. A prototype of either out of tree could be educational.

(edit: and you are of course, welcome to use/adapt any of the implementation that IREE has – it is liberally licensed and open source)

jpienaar · December 13, 2024, 9:22pm

Yes this is probably due to being on denseelementsattr vs not. Considering your point of view, this is a feature or not: it enables you to constrain what is being operated on. E.g., don’t create a resource for 4 scalar elements Now, patterns written on ElementsAttr would work. I had another variant in an unmerged PR (which may also be a good intro to usage here) which had a non-folding resource variant. I think the vanila folders should just operate on small cases really.

Yes this is torch-mlir change that I believe just does a blanket conversion. (probably torch-mlir/python/torch_mlir/extras/fx_importer.py at 8e0eafd022cd7555c8b58927d3238a7a89e9dbd4 · llvm/torch-mlir · GitHub that converts everything that is not scalar).

As to expanding, that is possible, switching to ElementsAttr in the iterator of the folder. Especially for elementswise should be simple. But it would have the effect of enabling folding where not currently. Potentially we can add in the “non-foldable” resource variant before/along with this to enable such folks to switch.

Do you want to interpret it outside? E.g., without using MLIR to parse or produce it? We need to improve the documentation, but also this is currently an implementation detail, so usage is just querying the raw blob. The PR that added it may be most helpful.

That sounds very reasonable. I think it can be complimented with some plugins that does more folding etc, and we’ve been meaning to get back to improving folder in MLIR for some time

I don’t think this is as complicated - we implemented a local variant of this for a specialized use case in ~3 days. The cost function was main thing that needs a lot of consideration.

stellaraccident · December 13, 2024, 10:35pm

I’m personally ready to call the book closed on whether the MLIR folders and canonicalizers should be used for tensor expression evaluation: I’m yet to see that produce a good outcome or the requisite level of control/testing. It’d be great to have better infra in this area but it needs to be opt-in and controllable. Newcomers always think this stuff matches the definition of what qualifies for folding/canonicalization but it is fiendishly hard to define what subset of folding in this space is neutral cost and universal. And then if you do manage to define some transformation that meets that criteria, it doesn’t help to be on an island: you really want a way to do constant simplification holistically in a non neutral way (but with control).

So +1 on better tooling here. -1 on making that automatic or part of folding/canonicalization.

FranckQC · December 13, 2024, 10:37pm

Thanks a lot for your answers both !

@stellaraccident : Just to make sure I understand correctly (I apologize, the MLIR project is still relatively new to me) I’ll try to reformulate to make sure I got your points correctly: when you’re talking about DenseElementsAttryou’re talking about what appears in my first example as the arith.constant dense<[[...]]>, right?

And you’re saying that such things are becoming less and less used, essentially because working with them tends to make compilation increase tremendously, right? I assume this huge compile-time increase is due to various passes trying to do some simplifications that involve such constructs.
Which is why in your own projects you’ve removed these simplifications happening on them. I think I see.

Now, I don’t understand why I don’t see simplifications happening when the constants are defined as resources. Is it because it’s more complicated do deal with them in a pass? I imagine there’s already all the infrastructure to “decode” the number of elements and then all the elements, right, and to do whatever simplification you want, right? If you know of a pass that does that, I’d like to see how it’s being implemented.

Basically now I’m trying to understand if for my needs I need to either:

do what I’ve described in my first message (spotting the constants, extracting them to another module, compiling it for the X86 host, running it on the host, then mutating the AST with the results, etc),
or if I could simply extend some passes (like ConstantFolding) to make them work on the resources. Maybe it’s just a case of “pattern-matching” against this construct, and querying the elements, and doing pretty-much what would be done for “normal” constants.
Or do you think the compile time increase would become too big for being done as direct rewrites on the ASTs, and proper execution ont he host is the only way to go?

Answering to @jpienaar a little later tonight. Thanks a lot both!

stellaraccident · December 13, 2024, 10:59pm

Yes. DenseElementsAttr the C++ class behind that ASM syntax. It was one of the very first things added to MLIR and has a lot of design flaws that make it unsuitable for a lot of uses. These discussions have happened over many years, so it’s a bit hard to just recall state on the spot. But most of the issues come down to the uniquing/copying that they employ. There is no way to simply map them to host memory/files and even if used with bytecode, normal use causes a number of copies to be made. When dealing with bulk data, you simply need more control as it needs to scale to cases that are actually zero copy.

We’ve had old profiles and traces that basically just show the machine grinding away for hours as it moves bulk data from one part of swap to the other and does the other accounting that DenseElementAttr enforces, basically forcing constant linear scans to check for uniqueness that would be a pretty good stress test of the OS virtual memory systems.

The Resource based attributes control for this while also enabling some other use cases (i.e. they are backed by mutable storage if you have a protocol for managing this).

I don’t just think: I know that to be the case. But more than the compile time issues, these folding rules are not nearly as cost model neutral as folks think.

Consider that it is perfectly feasible for a compiler to choose that it wants this canonical form at a certain phase of its pipeline:

%weights = arith.constant ...
%weights_t = linalg.transpose %weights
%result = linalg.matmul %lhs, %weights_t

Are we to fight the folders transposing and un-transposing the underlying data in order to achieve the situationally best layout? There are many cases like this (and variations) that make it untenable to have one universal tensor folding policy. In practice, we’ve found that we need to do tensor evaluation explicitly at certain phases in order to work with the whole compilation pipeline and minimize the amount of redundant mutations of the underlying data.

jpienaar · December 13, 2024, 11:33pm

As an example, look inside third_party/llvm/llvm-project/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp:

DenseElementsAttr splatAttr;
          if (matchPattern(def, m_Constant<DenseElementsAttr>(&splatAttr)) &&
              splatAttr.isSplat() &&
              splatAttr.getType().getElementType().isIntOrFloat()) {
            constantAttr = splatAttr.getSplatValue<TypedAttr>();
            return true;
          }

This would only apply if you have a DenseElementsAttr. But a resource is not one. It implements the ElementsAttr interface, so if this was based on that, this folder could trigger.

jpienaar · December 13, 2024, 11:34pm

Nobody said automatic Decoupling the what, how and when is important still.

stellaraccident · December 13, 2024, 11:43pm

Works for me then / happy to review a proposal. I just don’t want to come back in the new year and find another shady folding pattern I have to excise because I wasn’t watching some detailed PR which bit banged eager evaluation of a transpose into the default path. Not saying that hasn’t happened before

(Edit: I see you linked to one of the little gremlins above!)

(Edit 2: that one is actually fine/wai because it is scoped to a splat)

FranckQC · December 13, 2024, 11:49pm

Thanks a lot again for all your answers.

So if I understand correctly, in this “improving/extending constant folding passes” versus “proper and full execution on host for constants”, the arguments in favor of the execution on host are:

it leads to much faster compilation time compared to the first
it can benefit from all the other passes (both MLIR passes and LLVM passes), without having to re-implement their logic.
Is that correct?

I’m not sure I understand what is meant by “cost model neutral”. Can you please tell me more about that?

Then when you later write about “universal folding policy” I’m not sure I’m following and might be missing something important here. I didn’t think of any “policy of simplification for tensors”. I understand that one could theoretically decide to rewrite chains of computations in many different ways. But that’s not my goal. My naive idea was not to do anything clever here, just to full evaluate the constants terms (by “interpretation”(1) or by execution on host(2)), until a normal form (i.e. a fully evaluated tensor) is obtained, by just doing the operations that happen on the constants. Am I missing something?

Or maybe the way I initially expressed my goal made it sound more general than it is. But all I’m trying to do is this. Following-up on your example, let’s say I have:

%weights = arith.constant ... # probably a dense_resource
%weights_t = linalg.transpose %weights #or any other operation
%result = linalg.matmul %lhs, %weights

then I want to “unfold” all the computations completely in order to get the following IR:

%weights = arith.constant ...     # probably a dense_resource
%weights_t = arith.constant ...   # another dense_resource, fully evaluated
%result = linalg.matmul %lhs, %weights_t

(and the original weights could potentially be later dead-code eliminated).

Again, apologize if I’m still missing some of your points to both of you

Thanks a lot, that explains a lot!

FranckQC · December 14, 2024, 12:02am

No worries, I’m not planning to implement anything before I understand more on the constraints that exist, and the wishes of everybody. Which is why I created this thread. It’s not even a real RFC, just a way for me to understand more about the topic.

Of course, I’m not planning to do anything that the community doesn’t want/need. For now, I’m just trying to understand what are the possible ways to go for solving the evaluation of constants/weights, and I might play with some prototypes this winter.

But I promise: I won’t be making PRs to upstream that cause any kind of regression, don’t worry
(Btw, I hope this thread wasn’t what caused stress and the issues mentioned in: RFC: Blackout period for controversial design decisions . I’m not planning to rush and want to understand people’s needs and wishes. And everybody deserves a nice holiday break! )

stellaraccident · December 14, 2024, 12:12am

FranckQC:

then I want to “unfold” all the computations completely in order to get the following IR:
%weights = arith.constant ...     # probably a dense_resource
%weights_t = arith.constant ...   # another dense_resource, fully evaluated
%result = linalg.matmul %lhs, %weights_t
(and the original weights could potentially be later dead-code eliminated).

Yes, and I am saying specifically that is a pessimization in certain (very common) situations.

Sure - in conversations about the MLIR folding and canonicalization infrastructure, there are certain conventions we follow for determining whether something belongs as a folder or canonicalization pattern (folding can be thought of as a special case for simple forms of canonicalization). There are some (long) threads and commentaries on this, but one of the principles is that such things must be cost-model neutral – or said another way, universal. While there is some “squish” to this in terms of canonicalizations that are just changing op ordering without destroying information, tensor evaluation is a destructive operation: it removes semantics. So the bar to be included in the default folding/canonicalization patterns that everyone uses it quite high. This is why I’ve been trying to explain my feedback as being supportive of evolution here but it needs to be opt-in and controllable.

In practice, in the compiler pipelines I know best, we group tensor constants into three categories:

Splat (represented by DenseElementsAttr, spelled dense<N> with a single scalar value in the ASM form)
“Small constants” which we always inline as arith.constant even if they came into the program out-of-line (i.e. via load). We have a threshold/simple heuristic for this somewhere that I’m too lazy to dig up (and I think there are a few definitions that are not completely in sync as to what counts).
Everything else, which we accept as a DenseResourceElementsAttr but for frontends we control, have the frontend emit out of line as a global to be accessed with load/store by default.

When considering whether to eagerly evaluate, we use an analysis which attempts to discover const-expr trees that are profitable to materialize into a new value. This is driven by an oracle which makes specific operand-by-operand decisions (there used to be a lot more here but most of the policy is now in implementations of an interface).

Just doing a greedy evaluation is not generally profitable unless if scoped only to splats and “small constants” (and even then is pretty easy to introduce a pessimization). We apply a growth threshold, marking the tree as profitable if the evaluation would not expand the total constant size beyond some threshold. There are a fair number of bottleneck cases where an expression-tree may be profitable according to these rules where an op by op expansion may not be.

It’s further fiddly because of the case I showed (and many like it): the optimal data layout at rest and for generating code would often be pessimized if greedily evaluating everything.

I think what we reached a lot of the same place as what XLA does. The specifics are somewhat different, but the overall approach is born from similar constraints and evolution.

stellaraccident · December 14, 2024, 12:13am

Oh no - this is a nice casual conversation

We’ve just had a lot of these, many of them hung on differences of opinions and people need a break to know that if they don’t watch for a bit, it all works out.

stellaraccident · December 14, 2024, 12:38am

jpienaar:

As an example, look inside third_party/llvm/llvm-project/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp:
DenseElementsAttr splatAttr;
          if (matchPattern(def, m_Constant<DenseElementsAttr>(&splatAttr)) &&
              splatAttr.isSplat() &&
              splatAttr.getType().getElementType().isIntOrFloat()) {
            constantAttr = splatAttr.getSplatValue<TypedAttr>();
            return true;
          }
This would only apply if you have a DenseElementsAttr. But a resource is not one. It implements the ElementsAttr interface, so if this was based on that, this folder could trigger.

Close - I get where you were going with this, but this case doesn’t apply to DenseResourceElementsAttr because DRE cannot encode a splat (i.e. its ElementsAttr::isSplat() always returns false, iiuc). But indeed, the more future proof way that this should be defined would be to match on ElementsAttr. I had folks burning this down at one point but this must have been missed, probably because DenseElementsAttr is the only attribute type that overrides isSplat, allowing it to be true (so this cannot cause a regression with the current attribute types).

(I wish there were fewer spikey/disjoint edges to this stuff)

jkshtj · January 20, 2025, 4:47pm

@stellaraccident Thanks for your explanations on this post! I have run into a similar pickle and am trying to figure out the best alternative to using DenseElementsAttr. I have few quick follow-ups on some of your comments, if you have a minute.

In general my thinking so far has been that if something can be evaluated at compile time, do it to improve performance during run time. But from your comments it sounds like there are cases when this may not lead to the best run time performance. Would you say that these cases are usually very target-specific (and therefore a one-size-fits-all folding policy doesn’t apply)?

stellaraccident · January 25, 2025, 4:36am

Ideal layout at rest is definitely target specific. Since such layouts are often encoded as some form of transpose or tiling, depending on what level of IR you are in, they may end up being “reversed” to feed into some abstract math op that presumes something like a linear layout. Every compiler does it differently, but generally if you just greedily constant fold any time things look like an evaluable expression, at best, you’ll end up shuffling a lot of data as you fold and unfold as you progress. At worst, you end up destroying information that is hard to recover later by folding it into a constant.

There are various ways to deal with that, but I’ve generally favored approaches that materialize constants at very specific points in the pipeline. Where exactly depends on what you are doing.

Topic		Replies	Views
[MLIR] How to define tensor constants with std.constant? MLIR	3	793	October 12, 2020
Trouble Creating a Simple Constant Tensor MLIR	2	345	December 9, 2021
RFC: Global Variables in MLIR MLIR	18	2367	October 26, 2020
Failed to materialize conversion for result #0 of operation ... that remained live after conversion MLIR mlir	2	44	November 9, 2024
Improve Operation::fold implement MLIR	4	346	February 26, 2022

Computations involving constants, constant folding and dialect_resources

Related topics