[RFC] New `stack` dialect

I’d like to receive some feedback about a possible new dialect used to model stack-based languages, such as CIL, JVM and Ethereum bytecode.

Goal

Being able to convert to and from a stack-based language and MLIR’s SSA-based semantics.

New types

This dialect introduces one new type stack.stack, used to represent an opaque value used to implement a monadic handling of the stack state.

New operations

This dialect introduces two new operations: stack.push (!stack.stack, T) -> !stack.stack and stack.pop (!stack.stack) -> (!stack.stack, T). The semantics of these operations are what you would expect: they manipulate the stack they receive as input, and output the modified stack. They are Pure operations.

Example

The following pseudocode implements Euler’s algorithm:

:start
dup
push 0
eq
cbr :end :loop

:loop
dup
rol3
mod
br :start

:end
pop
; a
ret

which can be translated into the following MLIR module:

builtin.module {
  func.func @main(%0 : !stack.stack) -> !stack.stack {
    %stk, %1 = "stack.pop"(%0) : (!stack.stack) -> (!stack.stack, i32)
    %stk_1 = "stack.push"(%stk, %1) : (!stack.stack, i32) -> !stack.stack
    %stk_2 = "stack.push"(%stk_1, %1) : (!stack.stack, i32) -> !stack.stack
    %2 = arith.constant 0 : i32
    %stk_3 = "stack.push"(%stk_2, %2) : (!stack.stack, i32) -> !stack.stack
    %stk_4, %3 = "stack.pop"(%stk_3) : (!stack.stack) -> (!stack.stack, i32)
    %stk_5, %4 = "stack.pop"(%stk_4) : (!stack.stack) -> (!stack.stack, i32)
    %5 = arith.cmpi eq, %3, %4 : i32
    %stk_6 = "stack.push"(%stk_5, %5) : (!stack.stack, i1) -> !stack.stack
    %stk_7, %6 = "stack.pop"(%stk_6) : (!stack.stack) -> (!stack.stack, i1)
    "cf.cond_br"(%6, %stk_7, %stk_7) [^0, ^1] <{"operandSegmentSizes" = array<i32: 1, 1, 1>}> : (i1, !stack.stack, !stack.stack) -> ()
  ^0(%7 : !stack.stack):
    %stk_8, %8 = "stack.pop"(%7) : (!stack.stack) -> (!stack.stack, i32)
    func.return %stk_8 : !stack.stack
  ^1(%9 : !stack.stack):
    %stk_9, %10 = "stack.pop"(%9) : (!stack.stack) -> (!stack.stack, i32)
    %stk_10 = "stack.push"(%stk_9, %10) : (!stack.stack, i32) -> !stack.stack
    %stk_11 = "stack.push"(%stk_10, %10) : (!stack.stack, i32) -> !stack.stack
    %stk_12, %11 = "stack.pop"(%stk_11) : (!stack.stack) -> (!stack.stack, i32)
    %stk_13, %12 = "stack.pop"(%stk_12) : (!stack.stack) -> (!stack.stack, i32)
    %stk_14, %13 = "stack.pop"(%stk_13) : (!stack.stack) -> (!stack.stack, i32)
    %stk_15 = "stack.push"(%stk_14, %12) : (!stack.stack, i32) -> !stack.stack
    %stk_16 = "stack.push"(%stk_15, %11) : (!stack.stack, i32) -> !stack.stack
    %stk_17 = "stack.push"(%stk_16, %13) : (!stack.stack, i32) -> !stack.stack
    %stk_18, %14 = "stack.pop"(%stk_17) : (!stack.stack) -> (!stack.stack, i32)
    %stk_19, %15 = "stack.pop"(%stk_18) : (!stack.stack) -> (!stack.stack, i32)
    %16 = arith.remui %14, %15 : i32
    %stk_20 = "stack.push"(%stk_19, %16) : (!stack.stack, i32) -> !stack.stack
    "cf.br"(%stk_20) [^2] : (!stack.stack) -> ()
  }
}

which is then transformed into the following:

builtin.module {
  func.func @main(%0 : !stack.stack, %17 : i32, %18 : i32) -> (!stack.stack, i32) {
    %2 = arith.constant 0 : i32
    %5 = arith.cmpi eq, %2, %18 : i32
    "cf.cond_br"(%5, %0, %17, %18, %0, %17, %18) [^0, ^1] <{"operandSegmentSizes" = array<i32: 1, 3, 3>}> : (i1, !stack.stack, i32, i32, !stack.stack, i32, i32) -> ()
  ^0(%7 : !stack.stack, %19 : i32, %20 : i32):
    func.return %7, %19 : !stack.stack, i32
  ^1(%9 : !stack.stack, %21 : i32, %22 : i32):
    %16 = arith.remui %21, %22 : i32
    "cf.br"(%9, %22, %16) [^2] : (!stack.stack, i32, i32) -> ()
  }
}

Interactions with other dialects

I currently only have a PoC implementation (in xDSL) of some transformations that interact with the cf dialect. When these transformations are applied repeatedly in the correct order, a fixpoint can be found of a program that is completely rewritten in SSA form.

Push-Pop folding:

%stk1 = "stack.push"(%stk, %x) : (!stack.stack, T) -> !stack.stack
%stk2, %y = "stack.pop"(%stk1) : (!stack.stack) -> (!stack.stack, T)
foo(%stk2, %y)

can be rewritten as

foo(%stk, %x)

Common subexpression elimination

%stk1, %x = "stack.pop"(%stk) : (!stack.stack) -> (!stack.stack, T)
%stk2, %y = "stack.pop"(%stk) : (!stack.stack) -> (!stack.stack, T)
foo(%stk1, %x)
foo(%stk2, %y)

can be rewritten as

%stk1, %x = "stack.pop"(%stk) : (!stack.stack) -> (!stack.stack, T)
foo(%stk1, %x)
foo(%stk1, %x)

Arg pushing

   %stk1 = "stack.push"(%stk, %x) : (!stack.stack, T) -> !stack.stack
   "cf.br"(%stk1) [^1] : (!stack.stack) -> ()
ˆ1:(%stk2 : !stack.stack):
   foo(%stk2)

can be rewritten as

   %stk1 = "stack.push"(%stk, %x) : (!stack.stack, T) -> !stack.stack
   "cf.br"(%stk1) [^1] : (!stack.stack) -> ()
ˆ1:(%stk2 : !stack.stack):
   %stk3, %x = "stack.pop"(%stk2) : (!stack.stack) -> (!stack.stack, T)
   %stk4 = "stack.push"(%stk3, %x) : (!stack.stack, T) -> !stack.stack
   foo(%stk4)

Pop hoisting

   "cf.br"(%stk) [^1] : (!stack.stack) -> ()
ˆ1:(%stk1 : !stack.stack):
   %stk2, %x = "stack.pop"(%stk1) : (!stack.stack) -> (!stack.stack, T)
   foo(%stk2, %x)

can be rewritten as

   %stk1, %x = "stack.pop"(%stk) : (!stack.stack) -> (!stack.stack, T)
   "cf.br"(%stk1, %x) [^1] : (!stack.stack, T) -> ()
ˆ1:(%stk2 : !stack.stack, %y: T):
   %stk3 = "stack.push"(%stk2, %y) : (!stack.stack, T) -> !stack.stack
   %stk4, %z = "stack.pop"(%stk3) : (!stack.stack) -> (!stack.stack, T)
   foo(%stk4, %z)
2 Likes

How does the lowering work? Seems like similar to bufferization algorithm: there could be multiple stacks alive at the same time with this representation.

I have not thought about the lowering procedure yet since that is not currently one of the use cases I am thinking of using this for.

In any case, I would expect any lowering procedure to fail if it detects multiple stacks with different states to be alive at the same time. I am not familiar with the bufferization algorithm so I cannot really comment on that.

Maybe we should start there: what are the use cases?

1 Like

My specific use cases would be to perform static analysis and maybe recompilation of stack-based bytecode programs, but this seems a generic enough idea that I’d be interested in hearing other opinions on how this could be made more useful

But wouldn’t recompilation require to address the lowering question?

I would think that a stack based bytecode would manipulate a stack with side effecting operations instead of the SSA form you’re using here?

If the recompilation were to happen for another stack-based VM, that is true. If I wanted to target e.g. LLVM, that would not be needed

Targeting LLVM is also a “lowering path”: how would such a lowering path avoid the issue?

I think I may be misinterpreting the meaning of “lowering path” here…

Here’s what I meant: in the example I made in the first post, I took a fully stack-based pseudocode program, translated literally using the stack dialect to MLIR, then applied the transforms and arrived at a representation that only uses regular SSA form. What happens next is up to the final user – if I wanted to transform this further to LLVM, I guess I’d have to apply some transform passes to turn arith and func ops to LLVM ops

The transformed code still has a value of !stack.stack type, even though it looks dead. If I understand correctly, your approach expects that all operations from this dialect will ultimately be folded/canonicalized away. How is this guaranteed?

I’d also encourage you to write something up about semantics of the stack type and the corresponding operations. It appears, for example, that it is a sort of functional representation where the stack object itself is immutable, allowing things like common subexpression elimination when popping from the same stack object, but this is not explained. Note that an imperative interpretation of a stack would be reading from/writing to memory and mutating the stack pointer, so side-effecting. This is what is ultimately available on most lowering targets, hence Mehdi’s question about lowering.

I’m also not very familiar with stack-based IRs, so the arg pushing and pop hosisting look strange to me. I’d rather want the inverse transformations that decrease the amount of IR, especially if the goal is to remove all of it. Could you elaborate on why they are useful?

1 Like

I could see how a dialect like this could be useful for the purpose of lowering any of the stack-based IRs into MLIR. I am also curious whether there are any other use-cases besides the “Stack → SSA” transformation.

One thing that is unclear to me in the above post is whether there are any constraints in regards to the stack depth imposed on the stack dialect.
What makes the stack-based IR of languages like JVM Bytecode or Python bytecode easy to lower to SSA-based IRs is the constraint that the stack depth must be constant regardless of control flow path taken. I see no such constraint in the above dialect, making me curious how something like:

func.func @test() -> {
  %empty = stack.empty_I_just_made_up()
  cf.br ^bb0(%empty : !stack.stack)

^bb0(%arg0 : !stack.stack):
  %i32 = call @get_random_i32() : () -> i32
  %pushed = "stack.push"(%arg0, %i32)
  %1 = call @random_cond() : () -> i1
  cf.cond_br %1, ^bb0(%arg0 : !stack.stack), ^bb1(%arg0 : !stack.stack)

^bb1(%arg1 : !stack.stack):
  %s, %value = "stack.pop"(%arg1) : (!stack.stack) -> (!stack.stack, i32)
  return %value : i32
}

The above would e.g. not be allowed in JVM bytecode. Is a goal of the dialect to be able to lower this as well?

If you were to introduce a “constant stack depth regardless of control flow” constraint then you could use any existing SSA construction algorithm. You could add this constraint to the dialect by making the stack depth a parameter of the stack type and have every push operation add 1 to the stack depth and every pop operation subtract one. A typical “integer add” would then look like:

%empty = stack.empty() -> !stack.stack<0>
%pushed1 = stack.push %empty, %arg0 : (!stack.stack<0>, i32) -> !stack.stack<1>
%pushed2 = stack.push %pushed1, %arg1 : (!stack.stack<1>, i32) -> !stack.stack<2>
%popped1, %rhs = stack.pop %pushed1 : (!stack.stack<2>) -> (!stack.stack<1>, i32)
%popped2, %lhs = stack.pop %popped1 : (!stack.stack<1>) -> (!stack.stack<0>, i32)
%result = arith.addi %lhs, %rhs : i32

Control flow path dependent stacks would be impossible as you could not make a block argument with type mismatches:

  cf.br ^bb0(%empty : !stack.stack<0>)
^bb0(%arg0 : !stack.stack<???>):
  ...
  cf.cond_br %1, ^bb0(%pushed : !stack.stack<1>), ...

If your SSA lowering algorithm shouldn’t have to deal with type unsafety could also make the stack type parameterized on the list of types pushed so far, with the push operation adding a new type at the back of the type list and the pop operation removing the last type in the list.

2 Likes

It isn’t. Much like the “CF-to-SCF lifter” can’t guaranteed that no cf operation is left after the pass, this transformation can’t guarantee that every stack operation will be eliminated. In fact, given that this is supposed to run in a loop to find a fixpoint, I think it will actually never converge if the stack operations are not properly balanced.

I’m not making the claim this is the only way to implement this transformation, merely that this is theoretically possible, and possibly useful for somebody else.

The idea is that arg pushing introduces pop operations in all successor blocks of a branch instruction, while pop hoisting converts those pops into block arguments, thus making the data flow explicit.

I’m unsure on whether we want to add this sort of constraints to this dialect itself, or if we want end users to be able to express whatever they need. I agree it would be useful to have these constraints upfront.

That is an interesting approach. For now I’ve just been doing unification as I go along. I wonder if this would play nicer though.

Generally, I think a stack dialect would be interesting, but it is usually strongly tied to those languages’ semantics, and can be cumbersome to run existing passes with the dialect intermixed with arith, math etc.

From the little I know of stack based languages and IRs, they use the stack as if they were registers, so pop and push become load and store, and we can’t just reorder them around without doing some data dependency analysis.

It’d also be interesting in investigating a Wasm implementation in MLIR, where @zero9178’s point of stack restrictions would be stressed from a sandbox point of view.

Yeah, I tried to come up with a design for this, but other than having stack.stack to have a list of the current types (and being too verbose), we fall back to not having anything and transforming type safety as a validation/canonicalization pass in between transforms.

1 Like

I’m not sure I follow on the pop and push as load and store – stack-based languages may still have a concept of memory separate from the stack they execute load and store on.

But you’re absolutely correct about the reordering being a no-no: it’s why I envisioned the push and pop operations behave monadically: making the flow from one stack state to the next explicit eliminates the possibility of reordering them without modifying behavior.

This was my mental model, probably wrong, ignore me.

Interesting. How do you deal with conditionals, specifically in the context of a stack.stack semantics?

Example:

  %stk_0 = stack.push 10
  %stk_1 = stack.push 20
  if (cond) {
    %stk_2 = stack.push 30
  }
  %stk_3, %0 = stack.pop // 20 or 30?

This becomes three basic blocks with PHI nodes, with %0 being either 20 or 30. The users of %0 can be arbitrarily complex, and because now the state of each pop is associated with the PHI node %stk_3(%stk_1, %stk_2), every time you modify the CFG above %stk_3 you may have to re-run the analysis on the state of the whole stack downstream, no?

Again, assume ignorance where things don’t make sense. But my impression on stack based IRs is that they’re usually low level like ASM because it’s hard to run too many optimizations on it, especially high level ones.

It’s possible your intention with this is just plug it at the end of some pipeline, so avoiding all high level transforms, but you still need to stop other passes (unknowingly) destroying the state of the stack.

The way I would rewrite your example in this potential dialect is the following (pardon the pseudo-MLIR):

^0:
  %stk = stack.init
  %stk_0 = stack.push %stk, 10
  %stk_1 = stack.push %stk_0, 20
  cond_br %cond, [^1](%stk_1), [^2](%stk_1)

^1(%stk_2):
  %stk_3 = stack.push %stk_2, 30
  br [^2](%stk_3)

^2(%stk_4):
  %stk_5, %0 = stack.pop %stk_4

Now the various push before the branches can be pushed forwards across the branch, and the resulting pops (including the one already present) can be hoisted backwards. This process will need to run until a satisfactory state is reached – again, it’s not necessarily a terminating process. For example, in this case one path will always result in the stack being larger than in the other when the control flow reaches ^2.

I believe (without proof) that being able to recover a perfect SSA form in the presence of arbitrary control flow, without guarantees about the size of the stack at the beginning and end of each block, is undecidable. I think that most generated code will not be in this “pathological” category, though.

Yes! This sums up what I was trying to say, thanks!

My worry is not that we have to deal with is now, is that we need to protect from other passes unintentionally introducing pathological cases via transformations of the surrounding code.

This would have to be a design decision on the stack operations to stop them from doing rewrites (possibly enough to just tag it with memory side effects or something).So in all, not a huge deal, just another point on the design space.

1 Like

Some colleagues have mentioned that creating a new dialect might not be the best idea, and I should be thinking of creating a set of interfaces / traits, and make the transformations operate on those instead.

I have not looked into it with much detail, but I’d love to hear other opinions on the subject

If you create interfaces / traits, you’ll have to add them to existing ops. The question then becomes: “what ops would have stack traits”?

One way would be to have something like memref.alloc to possess address spaces that relate to a “stack”, and that be coded in some generic way (like a push-pop-semantics trait).

But what you propose is for scalar types (and bytecode), so not sure how you’d do that without an op to control that. Perhaps try to add this to the LLVM dialect and misuse llvm.alloca? Doesn’t sound like an attractive option.

You should also design how you’ll lower this to LLVM. That will probably be the most enlightening next step. It’ll give you what you really need in the end, and may even show you that you don’t need the dialect in the first place (been there, done that, three times).

Best case scenario, it will show you very clearly what ops/interfaces/traits you need (if any), with use cases to help you design the space. There’s nothing quite like a good PoC. :smile:

1 Like

This seems to be part of effort of making MLIR suitable for general purpose languages, where researches based on MLIR has been mainly focused on optimizing computations used on AI. As I am trying to adapt MLIR for implementing general purpose language, your suggestion seems quite interesting to me.

Here’s what I thought about this while implementing my stack-based language that lowers to MLIR.
Lots of compiled languages separate between l-values, and r-values. Where l-values are normally stored on stack, and r-values are treated as SSA value in LLVM IR. In MLIR, there wasn’t clear way of doing this, since there were no operations that can model these differences (except for memref, which are limited to representing tensor-like data).

Reason for doing this is l-value types can be used in other spaces throughout the program where stack is valid, and it’s easier to implement mutable variables, since they are stored on memory rather than SSA. Specific semantics will depend on the programming model, but it might not be a bad idea to provide interface for doing this in MLIR for someone who wants to implement these kind semantics.

Therefore, I would suggest we could extend this idea with l-value types. Where l-values are always stored on stack (stack.push operation in this example), and later optimized with mem2reg pass.

Suggestions

  1. Make stack.push operation return !stack.lvalue<type> which would look like stack.push(%input) -> !stack.lvalue<type>
    • Instead of building !stack.stack as a type, it can be more natural to view stack as state of the program and make push return stack-allocated value, which could be accessed later as an l-value.
  2. Let’s provide stack.push only, and make them pop automatically when scope ends.
    • We should only allow using them in region based operations with AutomaticAllocationScope
    • This would simplify need to pop stack-allocated data explicitly, reducing chance of memory leaks or segmentation faults.
  3. Provide stack.lookup { offset : index } (%input) -> !stack.lvalue<type> operation which would return memory to the current stack frame.
    • This operation can lookup contents stored on stack, starting from the top. Offset would be static value since it would be converted from name of the variable in frontend-language, which are determined statically.
    • It’s quite natural for language compilers to lookup local variables by tracking offset on the stack, when we think about how they track local variables.
    • Maybe we can make stack.push to return nothing, and only allow compilers to access stack contents with this operation.

Examples (&interaction with other dialects)

Following code written in Rust can be compiled as follows

fn gcd(a: i32, b: i32) -> i32 {
    if b != 0 {
        let divided = a % b
        gcd(b, divided)
    } else {
        a
    }
}

Here is the interpretation of same program with MLIR using stack dialect.

func.func @gcd(%a : i32, %b : i32) -> (i32) {
  // push a and b on stack since they are l-values
  stack.push %a
  stack.push %b
  // Perform not equal operation for generating condition.
  // We need not push this on the stack since it's r-value
  %cond = "arith.cmpi"(%lhs, %rhs) {predicate = 1 : i64} : (i32, i32) -> i1

  %x = scf.if %cond -> (i32) {
    // Lookup value in the top of the stack, which is 'b'
    %b_lv = stack.lookup { offset = 0 }  
    // Lookup value in top - 1 of the stack, which is 'a'
    %a_lv = stack.lookup { offset = 1 }  
    %divided = arith.remsi %a_lv, %b_lv : i32

    // Push divided since it's l-value (let divided = ...)
    stack.push %divided
    // 'divided' is now in top of the stack. lookup the value.
    %divided_lv = stack.lookup { offset = 0 } : i32

    // Call the function
    %res = func.call @gcd(%b_lv, %divided_lv) -> i32
    scf.yield %res
  } else {
    // Lookup 'a' on the stack
    %a_lv = stack.lookup { offset = 1 }  
    scf.yield %a_lv
  }

// When scope ends, stack allocated values are automatically popped, by moving stack pointer in assembly.
return %x
}

This can be modeled quite naturally towards LLVM IR using llvm.alloca operation and llvm.load, llvm.store instructions, providing capability for lowering towards LLVM IR as rengolin suggested. While providing stack management capability that frabert tried to implement.

Example I gave may seem quite inefficient due to frequent memory loads and stores, but they could moved away with optimization passes, such as mem2reg later on.