Confused about -convert-parallel-loops-to-gpu

  1. Why -convert-parallel-loops-to-gpu can’t generate gpu.launch for the following MLIR code?
    scf.parallel (%arg1, %arg2) = (%c0_268, %c0_269) to (%c1_270, %c8_271) step (%c1_272, %c1_273) {
      ...
      scf.parallel (%arg3, %arg4) = (%c0_283, %c0_284) to (%72, %73) step (%c1_285, %c1_286) {
        ...
        scf.for %arg5 = %c0 to %c512 step %c16 {
         ...
          %84 = vector.create_mask %74, %75, %c16 : vector<16x16x16xi1>
          %85 = vector.mask %84 { vector.contract {indexing_maps = [#map11, #map2, #map12], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %79, %81, %83 : vector<16x16xf32>, vector<16x16xf32> into vector<16x16xf32> } : vector<16x16x16xi1> -> vector<16x16xf32>
          vector.transfer_write %85, %subview_289[%c0, %c0], %82 {in_bounds = [true, true]} : vector<16x16xf32>, memref<?x?xf32, strided<[1000, 1], offset: ?>>
        }
        scf.reduce 
      } {mapping = [#gpu.loop_dim_map<processor = thread_x, map = (d0) -> (d0), bound = (d0) -> (d0)>, #gpu.loop_dim_map<processor = thread_y, map = (d0) -> (d0), bound = (d0) -> (d0)>]}
      scf.reduce 
    } {mapping = [#gpu.loop_dim_map<processor = block_x, map = (d0) -> (d0), bound = (d0) -> (d0)>, #gpu.loop_dim_map<processor = block_y, map = (d0) -> (d0), bound = (d0) -> (d0)>]}
  1. It seems that scf.parallel doesn’t get converted to gpu dialect when some functions are called in scf.parallel. What should I do?
scf.parallel ... {
    call @foo()
}

Hmm, I fail to reproduce either of the issues.

The first snippet, when slightly reduced for testing, outlines just fine. The same goes for an example with a function call present within mapped parallel loops.
Also, nothing obvious comes to my mind that could or should prevent outlining in here.

Which version of LLVM do you use?
If it is relatively up to date, could you share a more complete example?