How to reconcile the unresolved builtin.unrealized_conversion_cast that remain after running --reconcile-unrealized_casts?
unrealized_conversion_cast ops usually remain when some other operations haven’t been lowered to appropriate level such that these casts can be folded away.
I would suggest going through IR (starting with casts’ producers) and checking if there is anything unexpected e.g., a memref op in sea of llvm ops etc.
Then it’s matter of adding more lowering stages before reconcile-unrealized-casts pass and trying again.
If you have problems with a specific case, adding an example or a reproducer could help with debugging.
Hi. Thank you for the response. I am just starting to use mlir and may be making some silly errors. Please bear with me.
The example I have is a simple vector add example:
func.func @vector_add_host(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>) {
%c0 = arith.constant 0 : index
%length = memref.dim %arg0, %c0 : memref<?xf32>
// Define thread block and grid sizes.
// For a simple test, we might choose a fixed block size (e.g., 64)
// and calculate the grid size based on the vector length.
%block_size_x = arith.constant 64 : index
// Calculate grid size: ceil(length / block_size_x)
// Line 13 should be exactly this:
%num_blocks_x = affine.apply affine_map<()[s0, s1] → (s0 + s1 - 1)>()[%length, %block_size_x]
// Launch the GPU kernel
gpu.launch blocks(%b_x, %b_y, %b_z) in (%grid_x = %num_blocks_x, %grid_y = %c0, %grid_z = %c0)
threads(%t_x, %t_y, %t_z) in (%block_x = %block_size_x, %block_y = %c0, %block_z = %c0) {
// 1. Generate SSA values for GPU intrinsics
%bid_x = gpu.block_id x
%bdim_x = gpu.block_dim x
%tid_x = gpu.thread_id x
// 2. Calculate: block_id * block_dim
// The inputs must be SSA values (e.g., %bid_x, %bdim_x)
%block_offset = arith.muli %bid_x, %bdim_x : index
// 3. Calculate the global 1D index 'i': block_offset + thread_id
%g_id_x = arith.addi %block_offset, %tid_x : index
// Bounds check: ensure index is within the vector length
%in_bounds = arith.cmpi slt, %g_id_x, %length : index7
scf.if %in_bounds {
// Load: A[i] and B[i]
%a = memref.load %arg0[%g_id_x] : memref<?xf32>
%b = memref.load %arg1[%g_id_x] : memref<?xf32>
// Compute: A[i] + B[i]
%sum = arith.addf %a, %b : f32
// Store: C[i] = sum
memref.store %sum, %arg2[%g_id_x] : memref<?xf32>
}
gpu.terminator
}
return
}
And the command used is:
mlir-opt cuda_vadd.mlir --lower-affine --convert-scf-to-cf --convert-cf-to-llvm --arith-expand --convert-arith-to-llvm --finalize-memref-to-llvm --convert-func-to-llvm --convert-arith-to-llvm --convert-xevm-to-llvm --reconcile-unrealized-casts -o step1.mlir
I am using the latest llvm git version and the architecture is: x86_64-pc-unknown-linux
The error in mlir-translate after the above step is:
step1.mlir:22:11: error: LLVM Translation failed for operation: builtin.unrealized_conversion_cast
%19 = builtin.unrealized_conversion_cast %18 : i64 to index
It looks like some operation is not getting converted, and then you’re stuck with partly converted/unconverted IR. If these have different types, an unrealized_conversion_cast is inserted at the boundary.
You may be missing a pass in your pipeline. Try using –convert-to-llvm instead of all the other -convert-abc-to-llvm.
Thanks, much. I tried the simplest conversion:
mlir-opt cuda_vadd.mlir --lower-affine --convert-scf-to-cf --convert-to-llvm --reconcile-unrealized-casts -o step1.mlir
Then also, I am getting unrealized_conversion_casts.
Can you show the output of the pass? I.e., the full IR before reconcile-unrealized-casts.
Looks like you’re targeting GPU which requires a few more steps compared to your reproducers.
Generally, you want to run reconcile-unrealized-casts at the end of your pipeline.
I’d suggest having a look at GPUToNVVMPipeline to see which steps might be required or missing, and how such pipeline can be assembled.
Thank you. Here is the output:
…/mlir/cuda_vadd$ mlir-opt cuda_vadd.mlir --convert-scf-to-cf --convert-to-llvm --reconcile-unrealized-casts -o step1.mlir
…/mlir/cuda_vadd$ cat step1.mlir
#map = affine_map<()[s0, s1] → (s0 + s1 - 1)>
module {
llvm.func @vector_add_host(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: !llvm.ptr, %arg6: !llvm.ptr, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: !llvm.ptr, %arg11: !llvm.ptr, %arg12: i64, %arg13: i64, %arg14: i64) {
%0 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%1 = llvm.insertvalue %arg10, %0[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%2 = llvm.insertvalue %arg11, %1[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%3 = llvm.insertvalue %arg12, %2[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%4 = llvm.insertvalue %arg13, %3[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%5 = llvm.insertvalue %arg14, %4[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%6 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%7 = llvm.insertvalue %arg5, %6[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%8 = llvm.insertvalue %arg6, %7[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%9 = llvm.insertvalue %arg7, %8[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%10 = llvm.insertvalue %arg8, %9[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%11 = llvm.insertvalue %arg9, %10[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%12 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%13 = llvm.insertvalue %arg0, %12[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%14 = llvm.insertvalue %arg1, %13[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%15 = llvm.insertvalue %arg2, %14[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%16 = llvm.insertvalue %arg3, %15[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%17 = llvm.insertvalue %arg4, %16[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%18 = llvm.mlir.constant(0 : index) : i64
%19 = builtin.unrealized_conversion_cast %18 : i64 to index
%20 = llvm.extractvalue %17[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%21 = builtin.unrealized_conversion_cast %20 : i64 to index
%22 = llvm.mlir.constant(64 : index) : i64
%23 = builtin.unrealized_conversion_cast %22 : i64 to index
%24 = affine.apply #map()[%21, %23]
gpu.launch blocks(%arg15, %arg16, %arg17) in (%arg21 = %24, %arg22 = %19, %arg23 = %19) threads(%arg18, %arg19, %arg20) in (%arg24 = %23, %arg25 = %19, %arg26 = %19) {
%block_id_x = gpu.block_id x
%25 = builtin.unrealized_conversion_cast %block_id_x : index to i64
%block_dim_x = gpu.block_dim x
%26 = builtin.unrealized_conversion_cast %block_dim_x : index to i64
%thread_id_x = gpu.thread_id x
%27 = builtin.unrealized_conversion_cast %thread_id_x : index to i64
%28 = llvm.mul %25, %26 : i64
%29 = llvm.add %28, %27 : i64
%30 = llvm.icmp “slt” %29, %20 : i64
llvm.cond_br %30, ^bb1, ^bb2
^bb1: // pred: ^bb0
%31 = llvm.extractvalue %17[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%32 = llvm.getelementptr inbounds|nuw %31[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
%33 = llvm.load %32 : !llvm.ptr → f32
%34 = llvm.extractvalue %11[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%35 = llvm.getelementptr inbounds|nuw %34[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
%36 = llvm.load %35 : !llvm.ptr → f32
%37 = llvm.fadd %33, %36 : f32
%38 = llvm.extractvalue %5[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%39 = llvm.getelementptr inbounds|nuw %38[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
llvm.store %37, %39 : f32, !llvm.ptr
llvm.br ^bb2
^bb2: // 2 preds: ^bb0, ^bb1
gpu.terminator
}
llvm.return
}
}
Thanks. Will explore and revert.
You are missing passes in your pipeline that lower affine.apply and GPU dialect ops. You can see from the IR that the builtin.unrealized_conversion_cast ops are at the boundary between LLVM dialect ops and other (not yet lowered) ops.
Thanks for pointing this. Will check.
After adding more passes, I get the following output. The issue is still not resolved!
mlir-opt cuda_vadd.mlir --canonicalize --lower-affine --gpu-launch-sink-index-computations --convert-scf-to-cf --convert-gpu-to-spirv --convert-cf-to-llvm --arith-expand --convert-arith-to-llvm --finalize-memref-to-llvm --cse --convert-func-to-llvm --convert-arith-to-llvm --convert-xevm-to-llvm --reconcile-unrealized-casts
module {
llvm.func @vector_add_host(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: !llvm.ptr, %arg6: !llvm.ptr, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: !llvm.ptr, %arg11: !llvm.ptr, %arg12: i64, %arg13: i64, %arg14: i64) {
%0 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%1 = llvm.insertvalue %arg10, %0[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%2 = llvm.insertvalue %arg11, %1[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%3 = llvm.insertvalue %arg12, %2[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%4 = llvm.insertvalue %arg13, %3[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%5 = llvm.insertvalue %arg14, %4[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%6 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%7 = llvm.insertvalue %arg5, %6[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%8 = llvm.insertvalue %arg6, %7[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%9 = llvm.insertvalue %arg7, %8[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%10 = llvm.insertvalue %arg8, %9[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%11 = llvm.insertvalue %arg9, %10[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%12 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%13 = llvm.insertvalue %arg0, %12[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%14 = llvm.insertvalue %arg1, %13[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%15 = llvm.insertvalue %arg2, %14[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%16 = llvm.insertvalue %arg3, %15[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%17 = llvm.insertvalue %arg4, %16[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%18 = llvm.mlir.constant(64 : index) : i64
%19 = builtin.unrealized_conversion_cast %18 : i64 to index
%20 = llvm.mlir.constant(0 : index) : i64
%21 = builtin.unrealized_conversion_cast %20 : i64 to index
%22 = llvm.mlir.constant(1 : index) : i64
%23 = llvm.extractvalue %17[3] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%24 = llvm.alloca %22 x !llvm.array<1 x i64> : (i64) → !llvm.ptr
llvm.store %23, %24 : !llvm.array<1 x i64>, !llvm.ptr
%25 = llvm.getelementptr %24[0, %20] : (!llvm.ptr, i64) → !llvm.ptr, !llvm.array<1 x i64>
%26 = llvm.load %25 : !llvm.ptr → i64
%27 = llvm.mlir.constant(63 : index) : i64
%28 = llvm.add %26, %27 : i64
%29 = builtin.unrealized_conversion_cast %28 : i64 to index
gpu.launch blocks(%arg15, %arg16, %arg17) in (%arg21 = %29, %arg22 = %21, %arg23 = %21) threads(%arg18, %arg19, %arg20) in (%arg24 = %19, %arg25 = %21, %arg26 = %21) {
%30 = llvm.alloca %22 x !llvm.array<1 x i64> : (i64) → !llvm.ptr
llvm.store %23, %30 : !llvm.array<1 x i64>, !llvm.ptr
%31 = llvm.getelementptr %30[0, %20] : (!llvm.ptr, i64) → !llvm.ptr, !llvm.array<1 x i64>
%32 = llvm.load %31 : !llvm.ptr → i64
%block_id_x = gpu.block_id x
%33 = builtin.unrealized_conversion_cast %block_id_x : index to i64
%block_dim_x = gpu.block_dim x
%34 = builtin.unrealized_conversion_cast %block_dim_x : index to i64
%thread_id_x = gpu.thread_id x
%35 = builtin.unrealized_conversion_cast %thread_id_x : index to i64
%36 = llvm.mul %33, %34 : i64
%37 = llvm.add %36, %35 : i64
%38 = llvm.icmp “slt” %37, %32 : i64
llvm.cond_br %38, ^bb1, ^bb2
^bb1: // pred: ^bb0
%39 = llvm.extractvalue %17[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%40 = llvm.getelementptr inbounds|nuw %39[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
%41 = llvm.load %40 : !llvm.ptr → f32
%42 = llvm.extractvalue %11[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%43 = llvm.getelementptr inbounds|nuw %42[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
%44 = llvm.load %43 : !llvm.ptr → f32
%45 = llvm.fadd %41, %44 : f32
%46 = llvm.extractvalue %5[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%47 = llvm.getelementptr inbounds|nuw %46[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
llvm.store %45, %47 : f32, !llvm.ptr
llvm.br ^bb2
^bb2: // 2 preds: ^bb0, ^bb1
gpu.terminator
}
llvm.return
}
}
You’re still missing a pass that lowers GPU dialect ops. E.g., for NVIDIA: -gpu-lower-to-nvvm-pipeline.
For learning purposes, instead of writing a new test case from scratch, I would start experimenting with an existing integration test. E.g.: test/Integration/CUDA/printf.mlir for a “Hello World” test.
Oh! Thanks very much for the suggestion. Will check.
I am currently trying to compile for non-gpu architecture. I learnt that the nvvm related features are for gpu architectures.
Yes, NVVM is NVIDIA-specific. If you are compiling for non-GPU architectures, I’d remove all gpu dialect ops from your test case. The lowering of ops such as gpu.launch is quite architecture specific.