[RFC][OpenMP] Fix issue in MLIR to LLVMIR translation for delayed privatisation

tblah · September 12, 2024, 2:13pm

I have been working on [flang][OpenMP] Parallel region failure · Issue #106297 · llvm/llvm-project · GitHub and believe this shows a broader problem with the LLVM IR generation for delayed privatization. This RFC is to agree on a solution to this problem.

Here is a minimal reproducer

program bug
  implicit none
  integer :: table(10)
  !$OMP PARALLEL PRIVATE(table)
    table = 50
    if (any(table/=50)) then
      stop 'fail 3'
    end if
  !$OMP END PARALLEL
  print *,'ok'
End Program

I can only reproduce this at -O0 in this case, but I believe the same thing could happen for other code at different optimization levels.

Problem Description

Currently, the translation of privatization for omp.parallel from MLIR to LLVMIR (OpenMPToLLVMIRTranslation.cpp) works something like this

Initially the privatized variable is referenced via an MLIR block argument to the parallel region.

Replace (in MLIR) all usages of this block argument with the non-private copy of the variable (constructor for OmpParallelOpConversionManager). In this case, the shared copy of table is a global variable.
Generate LLVMIR for the body of the parallel region (bodyGenCB) using this incorrect reference to the shared copy of table
Run the privatization callback (privCB) which inlines the region which allocates the private variable and sets its llvmReplacementValue argument with what the private variable instance should be replaced with (in LLVMIR).
Replace references to the shared copy of table with the value for the private copy of table provided by the privatization callback in step 3 (OpenMPIRBuilder::createParallel).
Undo the changes to the MLIR (destructor for OmpParallelOpConversionManager)

The problem comes in step 4. This replacement does not work correctly if the value being replaced is an llvm::Constant instead of an llvm::Instruction. In the code example above, we are replacing the address of a global variable (treated as a constant) with a runtime value (an address returned by alloca). If the llvm::Constant is used by an llvm::Instruction we replace it successfully (e.g. where it is used in getelementptr to lower table/=50). However llvm::Constants can also be used as parts of other llvm::Constants (this is used to create a constant struct initializer used to populate the box passed to the runtime call to Assign for table = 50). In this case we cannot insert the result of the alloca into the compound constant (a constant cannot depend upon a runtime value). The conversion silently skips these cases and so not all references to the shared copy of table are converted.

The result of this is that the assignment table = 50 (or any other runtime call using the box descriptor) is performed on the shared copy of table not the private copy.

But what about clang?

Why doesn’t clang face the same issue? The privCB callback implemented by clang looks like a no-op (it sets the llvmReplacementValue equal to the source value so that no replacement is performed). I haven’t dug into this but presumably clang generates its own code for privatization in the body of the parallel region without relying upon the system in OpenMPIRBuilder. So far as I can tell, MLIR is the only real upstream usage of the privCB argument.

Potential Solutions

1. Implement transformations for instructions using the compound constants

In this case, the compound constant is used in a store. This store instruction can be rewritten to accept a runtime value for the descriptor’s data address field.

This is by far the simplest solution, but I do not think it is general enough. The replacement in step 4 happens in OpenMPIRBuilder, which is intended to share OpenMP LLVMIR generation amongst all LLVM frontends. We therefore cannot make any assumptions about what LLVM IR is contained inside of the parallel region. It is not possible in general to replace all uses of an llvm::Constant with a runtime value. Furthermore, it would be difficult to maintain a set of all of the possible transformations to allow instructions using constants to be changed to use runtime values. I have not found any existing helper in LLVM which already does this (but I am not very experienced in this area and might have missed something).

2. Don’t use `privCB` in `OpenMPToLLVMIRTranslation.cpp`

The steps enumerated in the problem description are too error prone. Compare this to how we create private copies of reduction variables:

Inline and convert to LLVMIR the allocation and initialization of the reduction variables
Set up a mapping from the MLIR block argument for the reduction variable to the LLVM value for the allocated private reduction variable
Generate LLVMIR for the body of the parallel region using the correct value for the private reduction variable from the start

Presumably this was not done for privatization because the privatization callback given to OpenMPIRBuilder runs after the body generation callback and so the llvm values for the private copies of the variable are not available at the time when the body of the parallel region is translated.

This solution would be to move privatization out of privCB and merge it with how reduction variables are processed. This could allow for some code sharing between privatization and reduction. This is my preferred solution.

The drawback of this approach is that it leaves the privCB callback unchanged in OpenMPIRBuilder, and this could lead to the same bug in other front ends.

3. Remove `privCB` from `OpenMPIRBuilder`

Follow on from solution (2) by removing this privatization callback which is no longer used in upstream frontends.

This would require broader buy-in from maintainers of OpenMPIRBuilder and might reduce future code sharing between frontends.

4. Move calls to `privCB` before `bodyGenCB` in `OpenMPIRBuilder`

This would allow keeping this callback design, possibly allowing for more code sharing between frontends, but by calling privCB before bodyGenCB it can work more like reduction variable privatization (described in solution 2).

This would require broader buy-in from maintainers of OpenMPIRBuilder and would be a silent semantics change to downstream code (it would still build but do something different).

kiranchandramohan · September 12, 2024, 2:28pm

@jdoerfert @Meinersbur Could you comment?

CC: @bob.belcher

agozillon · September 12, 2024, 2:38pm

Not sure how relevant/useful this is to the discussion so please do feel free to ignore it if it is, there’s a utility function in LLVM that converts constants to instructions that can help when you have to replace certain constants: llvm-project/llvm/lib/IR/ReplaceConstant.cpp at main · llvm/llvm-project · GitHub thought I’d just mention it incase it’s helpful in someway.

tblah · September 12, 2024, 2:45pm

Thank you this is very useful. This would make solution 1 more viable.

bhandarkar-pranav · September 16, 2024, 6:40pm

Thank you for this RFC because I think you have exposed a problem that I am going to encounter in my PR as well ([mlir][OpenMP] - Implement lowering from MLIR to LLVMIR for `private` clause on `target` constructs by bhandarkar-pranav · Pull Request #105471 · llvm/llvm-project · GitHub). My PR handles delayed privatization for non-allocatables for omp.target and there I inline (in MLIR though) the alloc region of the privatizer into the top of the target region while replacing the block argument of the alloc region with the value to be privatized. However, I suspect this’ll run into the same problem you are describing here.

I prefer Option 2. Would it be possible to move privCB to before bodygen?

tblah · September 17, 2024, 10:38am

Thanks for taking a look!

If I understand your PR correctly, you are replacing the block argument in MLIR not in LLVMIR (via the privatisation callback handling). MLIR does not have special treatment for constants and so I think this should work fine with your approach. However, I still think the approach used for reductions is cleaner because it doesn’t require in place modification of the MLIR.

Option 2 is most similar to your approach, except it performs the inlining and substitution when translating MLIR to LLVMIR instead of performing it in MLIR then afterwards translating the whole thing to LLVMIR.

Moving privCB before bodygen is option 4. I think this is possible but I wouldn’t want to take this approach without approval from people who work on OpenMPIRBuilder.

bhandarkar-pranav · September 17, 2024, 1:56pm

If I understand your PR correctly, you are replacing the block argument in MLIR not in LLVMIR (via the privatisation callback handling).

Yes, that’s correct.

MLIR does not have special treatment for constants and so I think this should work fine with your approach. However, I still think the approach used for reductions is cleaner because it doesn’t require in place modification of the MLIR.

Thank you. I am working on a pass that prepares privatized allocatables for target offload (the descriptor needs to be mapped). Once I am done, I’ll try the inlineConvertOmpRegions approach for my present PR.

Moving privCB before bodygen is option 4. I think this is possible but I wouldn’t want to take this approach without approval from people who work on OpenMPIRBuilder.

Makes sense.

kiranchandramohan · September 26, 2024, 10:53am

ping @jdoerfert Would you have an opinion here?

jansjodi · September 26, 2024, 4:21pm

I think option 2 and 3 make sense. If Clang already does this it would be easier to share code with clang in the future. When working on reductions for the GPU we were able to make Clang and MLIR share the same code generation in the OpenMPIRBuilder. If it is possible to have code shared between reductions and privatization that would also be nice since if there is an established pattern it would hopefully prevent similar bugs in the future.

tblah · October 10, 2024, 2:44pm

Pull request implementing option 2 [mlir][OpenMP] rewrite conversion of privatisation for omp.parallel by tblah · Pull Request #111844 · llvm/llvm-project · GitHub

Topic		Replies	Views
[RFC] Privatisation in OpenMP dialect MLIR	19	1431	February 17, 2022
Delayed privatization for `omp.wsloop` (or the need for pure `alloca` regions in `omp.private`) Flang mlir , openmp	2	55	January 9, 2025
Issues with the current implementation of privatization in OpenMP with Fortran Flang	10	751	May 16, 2022
Dialect for data locality/sharing specifiers/clauses in OpenMP, OpenACC, and `do concurrent` Flang mlir , openmp	9	150	May 5, 2025
[RFC][OpenMP] supporting delayed task execution with firstprivate variables Flang openmp	8	294	January 31, 2025

[RFC][OpenMP] Fix issue in MLIR to LLVMIR translation for delayed privatisation

Problem Description

But what about clang?

Potential Solutions

1. Implement transformations for instructions using the compound constants

2. Don’t use privCB in OpenMPToLLVMIRTranslation.cpp

3. Remove privCB from OpenMPIRBuilder

4. Move calls to privCB before bodyGenCB in OpenMPIRBuilder

Related topics

2. Don’t use `privCB` in `OpenMPToLLVMIRTranslation.cpp`

3. Remove `privCB` from `OpenMPIRBuilder`

4. Move calls to `privCB` before `bodyGenCB` in `OpenMPIRBuilder`