Moving the omp.parallel when converting scf to OpenMP

I’m playing with scf.parallel and then lowering it to OpenMP with the --convert-scf-to-openmp pass.

Although I find it very convenient for parallelizing outerloops, I’m not satisfied with the result when parallelizing inner loops. The problem is that the omp.parallel is always inserted in the loop being parallelized. For example:

scf.for {
  scf.parallel {
    scf.for {
      ...
    }
  }
}

would be converted to

scf.for {
  omp.parallel {
    omp.wsloop {
      scf.for {
        ...
      }
    }
  }
}

This creates/destroys the threads inside the first loop, which creates massive overhead. Is there any elegant way of moving the omp.parallel to the top-level loop to avoid the threading overhead while maintaining the rest of the program single thread except for the omp.wsloop? I wonder if it has been any discussion regarding this.

You can do that, if you proof it’s save or you insert appropriate guarding, and always the appropriate barriers.
See the discussion and example in Section 5: https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf