Replace call stack with an equivalent on the heap?


I’m implementing a custom Haskell-to-LLVM compiler, and in my experimentation, noticed that GHC is much slower than clang certain examples, such as the ackermann function. However, from reading their respective IRs (Cmm for GHC and LLVM for clang), I don’t really see much of a difference. Here is a link to the numbers. (n, m) are the parameters to the ackermann function

One major difference is that GHC uses a “stack-in-the-heap”, in the sense that the a is a chunk of heap memory that functions effectively as call stack. It is managed explicitly by GHC. GHC does not use the native call stack at all. This is to implement continuation passing, in a sense.

I want to check that the difference in the performance is truly from this “stack in the heap” behavior, so I want to teach a backend to generate code that looks like this.

Is code generating like this something that I could reasonably teach the backend? To be specific, I wish to teach the backend to manipulate a “custom stack”. I assume this would affect codegen, because one can no longer use push/pop/call/ret.



If the indirect jumps via this stack are all made from the same location in the GHC runtime, then perhaps this might kill branch prediction. call/ret addresses are duplicated in a branch predictor’s own hardware stack. If the runtime doesn’t use call/ret, not even indirect calls, then this functionality is lost and general branch prediction logic has to kick in. The predictors may then either fail for a long time before a pattern is recognized via the jump site in the runtime, or may even not recognize it and always mispredict. There may be differences between perceptron and other techniques in this case, so see if it’s equally bad on chips that use perceptrons (some AMD?) if yours doesn’t.

There should be a smoking gun somewhere in the performance monitoring registers then, I’d hope. It’d be very obvious - persistent branch misprediction at the level of functions costs dearly.

I have rather cursory knowledge in this area so perhaps the state of the art is way ahead of my imagination, or perhaps I’m dead wrong.

Cheers, Kuba


Thanks for that idea, I had not considered that. I ran perf on the same examples as before. However, I do not see a large difference in branch mis-predicts. Link to gist with numbers here. Is there something else that I missing? The instruction mix, perhaps? The C version has roughly 1 more instruction per cycle. I am not sure how significant that is, however.

Is there some other way to pin down what is going on in terms of slowdown? (read the asm? profile?)