Performance issues with memref.global and LLVM IR

Hello!

I was investigating compilation performance issues in my local project and find some strange thing.

In my project I used many memref.global constants with different sizes - from small 4 elements array to large multi-dimensional constants for convolution weights.
In some cases, when the convolution weights were large, the compilation process might took too much time and consume a lot of RAM. The bootlenech was in MLIR IR to LLVM IR conversion and in further LLVM machine code generation. In first part the most time was consumed by global constants translation.

I’ve added a simple pass, which just merges all constants into one single 1D I8 constant and replaces old constants accesses with memref.view to that merged constant. That’s significantly improved compilation time and reduced RAM usage. For example, on one network I’ve got the following improvements:

  • 6 seconds vs 35 minutes compilation time
  • 800 Mb vs 20 Gb peak memory usage

For me, such results looks quite strange. I’ve expected some improvements, but not such large.

So, my question - why large multi-dimensional global constants processing is so inefficient in MLIR->LLVM translation and further machine code generation?

Personally, I have one guess - LLVM in contrast to MLIR doesn’t support single multi-dimensional constant, if I understood the MLIR->LLVM conversion correctly. So, single memref.global : memref<256x128x5x5> will be converted to 163840 LLVM constants of 5 elements.

Interesting. I hit a similar problem when trying to build random dense globals in MLIR that needed to be an array of APFloat on similar sized memrefs and getting malloc errors trying to allocate a ridiculous amount of memory.

I could probably generate the random vector in memory, convert to base64 and initialize it as a string, but this could probably help you identify the source of the memory consumption (and performance problems).

I’ve been told DenseResourceAttribute should make that easier, but I didn’t have time to investigate yet.

This is an MLIR solution to avoid hitting the MLIRContext with constants, but I don’t think that will help the translation to LLVM (assuming you still want the constant in the LLVM IR).

Ouch… Seems unfortunate.
I tried:

$ echo 'llvm.mlir.global external @gv2(dense<[[0.000000e+00, 1.000000e+00, 2.000000e+00], [3.000000e+00, 4.000000e+00, 5.000000e+00]]> : tensor<2x3xf32>) {addr_space = 0 : i32} : !llvm.array<2 x array<3 x f32>>' |  bin/mlir-translate --mlir-to-llvmir 
@gv2 = global [2 x [3 x float]] [[3 x float] [float 0.000000e+00, float 1.000000e+00, float 2.000000e+00], [3 x float] [float 3.000000e+00, float 4.000000e+00, float 5.000000e+00]]

So basically we pay some cost to form the multi-dimensional structure here?
Seems like we could emit a flat constant for the storage and cast it to the right shape maybe?

Yes, I also have feeling that it could be done more efficiently - either in the translator itself or even on MLIR side in some MemRef->LLVM lowering pass. Should I create a ticket for that?

2 Likes

It can worthwhile to track this in a ticket, yes (if you can include a reproducer, it would be nice)

I also encountered this issue, and realized that it seemed the translation recursively creates LLVM constant one by one: llvm-project/ModuleTranslation.cpp at main · llvm/llvm-project · GitHub

Here is the function to translate global variables, which shows that string variable is translated in one shot. It looks like if we pack data as string, we potentially get better result too.