Playing around with a C dialect and wondering about how high-level it should be

Hello all! I’ve really been enjoying using MLIR in a hobby C compiler that I’m writing. I’ve had a lot of fun gradually building out a dialect for C. But I’ve had some lingering doubts as to how “high-level” the dialect ought to be. I’ll provide some examples below and would love to hear your thoughts.

Array-to-pointer conversion

My first prompt for your feedback/thoughts: should array-to-pointer conversion be modeled in the frontend’s AST, or should it be done implicitly in the MLIR dialect for C?

Take the following C program, which features array-to-pointer conversion (a.k.a. “decay”):

int puts(char *);

int main() {
  char s[1];
  s[0] = 'a';
  return puts(s);
}

For this program my hobby compiler produces an AST that’s nearly identical to Clang’s (godbolt link), in that both mine and Clang include AST nodes that represent the implicit array-to-pointer conversion:

FunctionDeclaration [3:1,7:2) 'main' int
  CompoundStatement [3:12,7:2)
    DeclarationStatement [4:12,4:13)
      VariableDeclaration [4:8,4:9) 's' char[1]
    ExpressionStatement [5:3,5:14)
      BinaryOperatorExpression [5:3,5:13) assign char rvalue
        ArraySubscriptExpression [5:3,5:7) char lvalue
          ImplicitCastExpression [5:3,5:4) array-to-pointer char* rvalue
            IdentifierExpression [5:3,5:4) 's' char[1] lvalue
          IntegerLiteralExpression [5:5,5:6) `0` int rvalue
        CharacterLiteralExpression [5:10,5:13) `'a'` char rvalue
    ReturnStatement [6:3,6:18)
      CallExpression [6:10,6:17) int rvalue
        IdentifierExpression [6:10,6:14) 'puts' int lvalue
        ImplicitCastExpression [6:15,6:16) array-to-pointer char* rvalue
          IdentifierExpression [6:15,6:16) 's' char[1] lvalue

My hobby compiler lowers each of these nodes into an operation. The array-to-pointer implicit cast expression is lowered to an operation c.decay:

func @main() -> !c.int {
  // Like Clang, my hobby compiler uses an alloca for the
  // return value of non-void functions:
  %0 = c.alloca 'return value', !c.int : !c<"int*">
  %1 = c.constant 0 : !c.int
  c.store %1, %0 : !c<"int*">

  // Declaration of `char s[1]`:
  %2 = c.alloca 's', !c<"char[1]"> : !c<"char[1]*">
  // Store of `s[0] = 'a'`:
  %3 = c.constant 97 : !c.char
  %4 = c.decay %2 : !c<"char*">
  %5 = c.constant 0 : !c.int
  %6 = c.subscript %4[%5] : !c<"char*">
  c.store %3, %6 : !c<"char*">
  // Call of `puts(s)`:
  %7 = c.decay %2 : !c<"char*">
  %8 = c.call @puts(%7) : !c.int
  // Return statement `return <result of call to puts>`:
  c.store %8, %0 : !c<"int*">
  br ^bb1

^bb1:  // pred: ^bb0
  %9 = c.load %0 : !c.int
  c.return %9 : !c.int
}

My hobby compiler provides conversion patterns for its C dialect operations to LLVMIR dialect. c.subscript and c.decay are lowered into llvm.getelementptr:

// Declaration of `char s[1]`:
%3 = llvm.mlir.constant(1 : i64) : !llvm.i32
%4 = llvm.alloca %3 x !llvm<"[1 x i8]"> {alignment = 4 : i64} : (!llvm.i32) -> !llvm<"[1 x i8]*">
// Store of `s[0] = 'a'`...
%5 = llvm.mlir.constant(97 : i8) : !llvm.i8
%6 = llvm.mlir.constant(0 : i32) : !llvm.i32
// ...first `c.decay` is lowered as a GEP to retrieve a pointer to `s`...
%7 = llvm.getelementptr %4[%6, %6] : (!llvm<"[1 x i8]*">, !llvm.i32, !llvm.i32) -> !llvm<"i8*">
// ...then `c.subscript` is lowered as a GEP into the 0th element of `s`:
%8 = llvm.mlir.constant(0 : i32) : !llvm.i32
%9 = llvm.getelementptr %7[%8] : (!llvm<"i8*">, !llvm.i32) -> !llvm<"i8*">
llvm.store %5, %9 : !llvm<"i8*">

Using my hobby compiler’s current behavior (which is nearly identical to Clang’s, except Clang lowers directly to LLVM IR and does add the intermediary MLIR), the C dialect does very little. The c.decay and c.subscript operations are “synonyms” for the LLVM IR getelementptr instruction.

However, it occurs to me that it doesn’t need to be this way. An alternative is that c.subscript and c.call automatically decay array arguments, like so:

  // Declaration of `char s[1]`:
  %2 = c.alloca 's', !c<"char[1]"> : !c<"char[1]*">

  // Store of `s[0] = 'a'`...
  %3 = c.constant 97 : !c.char
  %4 = c.constant 0 : !c.int
  // ...'c.subscript' knows to decay
  // `!c<"char[1]*">` to `!c<"char*">`:
  %5 = c.subscript %2[%5] : !c<"char*">
  c.store %3, %6 : !c<"char*">

  // Call of `puts(s)`. Again, the alloca for `s`
  // is automatically decayed:
  %6 = c.call @puts(%2) : !c.int

Performing decay “automatically” in the C IR makes the IR much nicer, and it more closely resembles the source C program. On the other hand, it keeps the implicit conversions implicit, and so potentially less useful for analysis. I asked about how useful these implicit conversions were on Twitter (link to my tweet), and was told that they were very useful in Microsoft’s implementation of Checked C (link to tweets 1 and 2, link to checkedc project on GitHub).

Lvalue-to-rvalue conversion

The tweet I linked to above specifically calls out lvalue-to-rvalue conversion as being useful for analysis. My second prompt for your feedback: should a C IR model this explicitly?

Once again here’s a small example program and the AST my hobby compiler produces (which again is nearly identical to Clang’s, as you can see in this godbolt link):

int main() {
  int a;
  a = 1;
  return a;
}
ExpressionStatement [3:3,3:9)
  BinaryOperatorExpression [3:3,3:8) assign int rvalue
    IdentifierExpression [3:3,3:4) 'a' int lvalue
    IntegerLiteralExpression [3:7,3:8) `1` int rvalue
ReturnStatement [4:3,4:12)
  ImplicitCastExpression [4:10,4:11) lvalue-to-rvalue int rvalue
    IdentifierExpression [4:10,4:11) 'a' int lvalue

Here’s my C dialect’s representation of this program:

func @main() -> !c.int {
  %0 = c.alloca 'return value', !c.int : !c<"int*">
  %1 = c.constant 0 : !c.int
  c.store %1, %0 : !c<"int*">

  // Declaration of `int a`:
  %2 = c.alloca 'a', !c.int : !c<"int*">
  // Store of `a = 1`:
  %3 = c.constant 1 : !c.int
  c.store %3, %2 : !c<"int*">
  // Implicit lvalue-to-rvalue conversion of `a`:
  %4 = c.load %2 : !c.int
  // Return of <`a` converted to an rvalue>:
  c.store %4, %0 : !c<"int*">
  br ^bb1

^bb1:  // pred: ^bb0
  %5 = c.load %0 : !c.int
  c.return %5 : !c.int
}

Again, the implicit cast expression is represented concretely using a node in the AST, and represented in the C IR as a c.load operation. And, once again, it strikes me that it doesn’t need to be this way. For example, this could be an alternative:

func @main() -> !c.int {
  // Replace 'c.alloca' with 'c.declare' to declare a variable:
  %0 = c.declare 'return value', !c.int : !c<"int.lvalue">
  %1 = c.constant 0 : !c.int
  c.store %1, %0 : !c<"int.lvalue">

  // Declaration of `int a`:
  %2 = c.declare 'a', !c.int : !c<"int.lvalue">
  // Store of `a = 1`:
  %3 = c.constant 1 : !c.int
  c.store %3, %2 : !c<"int.lvalue">
  // Return of `a`. The alloca in `%2` is implicitly
  // loaded as an rvalue:
  c.store %2, %0 : !c<"int.lvalue">
  br ^bb1

^bb1:  // pred: ^bb0
  c.return %0 : !c.int
}

Once again, the tradeoff I see here is that while the alternative C IR more closely resembles the C source program, it hides the implicit conversions that are taking place.

Explicit vs. implicit “usual arithmetic conversions”

C standard 6.3.1.8 describes conversion patterns for arithmetic types that are collectively known as “usual arithmetic conversions”. This is my last prompt, and as with all the other prompts above, my compiler and Clang both model these conversions in the AST:

int main() { return 'c'; }
ReturnStatement [1:14,1:25)
  ImplicitCastExpression [1:21,1:24) arithmetic int rvalue
    CharacterLiteralExpression [1:21,1:24) `'c'` char rvalue

My C dialect models this cast in the IR:

%2 = c.constant 99 : !c.char
%3 = c.cast %2 : !c.int
// Storing to the return value alloca, of type `!c<"int*">`:
c.store %3, %0 : !c<"int*">

But once again, an alternative could be to perform the conversion implicitly:

%2 = c.constant 99 : !c.char
// `!c.char` is implicitly converted to a `!c.int` before being
// stored to the return value alloca, of type `!c<"int*">`:
c.store %2, %0 : !c<"int*">

Feedback is welcome!

My gut is telling me that modeling these implicit conversions explicitly in the C IR is the right thing to do. This means the C IR does not completely resemble the C source program, and while that makes it less aesthetically pleasing, it makes it easier for MLIR passes to analyze and optimize. Let me know if you disagree or what you think!

This is a very cool project Brian, thanks for sharing your journey!

I think there is a fairly inherent tension here: you can make things more simple and orthogonal by making all (implicit) conversions explicit in the IR, or you can bury them into various nodes (like subscript etc) and both are conceptually identical in expressive capability. The tradeoff/tension here is that an explicit representation makes analysis and transformation simpler and makes the semantics of things like subscript easier to describe (good for everything in the system), but it comes at a higher memory cost.

If I had to do it again, I would make those conversions explicit. As you mention, lvalue to rvalue conversion in C has “memory access” semantics, which always have to be materialized for volatile accesses and is useful to know about in static analysis tools.

You mention that you’re concerned about “closely resembles the C source program” - I don’t think that is a concern if your IR captures strictly more information than what exists in the source. Things that don’t want to know about the conversions can ignore the implicit_decay_op’s or whatever.

I’m curious to know what other’s (e.g. @rjmccall) think about this,

-Chris

1 Like

Well, it depends on what you’re trying to represent in your MLIR dialect. If it’s meant to represent the underlying high-level semantics of C, I would encourage you to eliminate concepts like array-to-pointer decay and subscripting when lowering to it, since both are explicitly defined in terms of other semantic concepts.

When you’re just implementing a simplified C, it’s very tempting to model lvalue-to-rvalue conversion implicitly, since there are a ton of interesting places when you can treat it specially in the semantics. Modeling it explicitly becomes a lot more important when you start implementing the effects of interesting qualifiers; many of those qualifiers are extensions, but the core language does have volatile, and it becomes a lot trickier to implement volatile semantics correctly without explicit modeling.

1 Like

Hi,
This project seems very interesting and relevant to what we need. I see the thread here is from almost 3 years ago, so I wonder if you continued wit this.

On and off, yes, but I’ve mostly abandoned it at this point. There’s work being done to implement a Clang IR, which I think is very promising! ClangIR · A new high-level IR for clang.