Reconciling unrealized casts

rkkulkarni · October 17, 2025, 7:14am

How to reconcile the unresolved builtin.unrealized_conversion_cast that remain after running --reconcile-unrealized_casts?

asiemien · October 17, 2025, 7:42am

unrealized_conversion_cast ops usually remain when some other operations haven’t been lowered to appropriate level such that these casts can be folded away.

I would suggest going through IR (starting with casts’ producers) and checking if there is anything unexpected e.g., a memref op in sea of llvm ops etc.
Then it’s matter of adding more lowering stages before reconcile-unrealized-casts pass and trying again.

If you have problems with a specific case, adding an example or a reproducer could help with debugging.

rkkulkarni · October 17, 2025, 9:05am

Hi. Thank you for the response. I am just starting to use mlir and may be making some silly errors. Please bear with me.

The example I have is a simple vector add example:

func.func @vector_add_host(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: memref<?xf32>) {
%c0 = arith.constant 0 : index
%length = memref.dim %arg0, %c0 : memref<?xf32>

// Define thread block and grid sizes.
// For a simple test, we might choose a fixed block size (e.g., 64)
// and calculate the grid size based on the vector length.
%block_size_x = arith.constant 64 : index

// Calculate grid size: ceil(length / block_size_x)
// Line 13 should be exactly this:
%num_blocks_x = affine.apply affine_map<()[s0, s1] → (s0 + s1 - 1)>()[%length, %block_size_x]

// Launch the GPU kernel
gpu.launch blocks(%b_x, %b_y, %b_z) in (%grid_x = %num_blocks_x, %grid_y = %c0, %grid_z = %c0)
threads(%t_x, %t_y, %t_z) in (%block_x = %block_size_x, %block_y = %c0, %block_z = %c0) {

// 1. Generate SSA values for GPU intrinsics
%bid_x = gpu.block_id x 
%bdim_x = gpu.block_dim x 
%tid_x = gpu.thread_id x 

// 2. Calculate: block_id * block_dim
// The inputs must be SSA values (e.g., %bid_x, %bdim_x)
%block_offset = arith.muli %bid_x, %bdim_x : index

// 3. Calculate the global 1D index 'i': block_offset + thread_id
%g_id_x = arith.addi %block_offset, %tid_x : index

// Bounds check: ensure index is within the vector length
%in_bounds = arith.cmpi slt, %g_id_x, %length : index7

scf.if %in_bounds {
  // Load: A[i] and B[i]
  %a = memref.load %arg0[%g_id_x] : memref<?xf32>
  %b = memref.load %arg1[%g_id_x] : memref<?xf32>

  // Compute: A[i] + B[i]
  %sum = arith.addf %a, %b : f32

  // Store: C[i] = sum
  memref.store %sum, %arg2[%g_id_x] : memref<?xf32>
}
gpu.terminator

}
return
}

And the command used is:

mlir-opt cuda_vadd.mlir --lower-affine --convert-scf-to-cf --convert-cf-to-llvm --arith-expand --convert-arith-to-llvm --finalize-memref-to-llvm --convert-func-to-llvm --convert-arith-to-llvm --convert-xevm-to-llvm --reconcile-unrealized-casts -o step1.mlir

I am using the latest llvm git version and the architecture is: x86_64-pc-unknown-linux

The error in mlir-translate after the above step is:
step1.mlir:22:11: error: LLVM Translation failed for operation: builtin.unrealized_conversion_cast
%19 = builtin.unrealized_conversion_cast %18 : i64 to index

matthias-springer · October 17, 2025, 10:07am

It looks like some operation is not getting converted, and then you’re stuck with partly converted/unconverted IR. If these have different types, an unrealized_conversion_cast is inserted at the boundary.

You may be missing a pass in your pipeline. Try using –convert-to-llvm instead of all the other -convert-abc-to-llvm.

rkkulkarni · October 17, 2025, 10:29am

Thanks, much. I tried the simplest conversion:
mlir-opt cuda_vadd.mlir --lower-affine --convert-scf-to-cf --convert-to-llvm --reconcile-unrealized-casts -o step1.mlir

Then also, I am getting unrealized_conversion_casts.

matthias-springer · October 17, 2025, 2:04pm

Can you show the output of the pass? I.e., the full IR before reconcile-unrealized-casts.

asiemien · October 17, 2025, 2:11pm

Looks like you’re targeting GPU which requires a few more steps compared to your reproducers.
Generally, you want to run reconcile-unrealized-casts at the end of your pipeline.

I’d suggest having a look at GPUToNVVMPipeline to see which steps might be required or missing, and how such pipeline can be assembled.

rkkulkarni · October 18, 2025, 5:03am

Thank you. Here is the output:

…/mlir/cuda_vadd$ mlir-opt cuda_vadd.mlir --convert-scf-to-cf --convert-to-llvm --reconcile-unrealized-casts -o step1.mlir
…/mlir/cuda_vadd$ cat step1.mlir
#map = affine_map<()[s0, s1] → (s0 + s1 - 1)>
module {
llvm.func @vector_add_host(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: !llvm.ptr, %arg6: !llvm.ptr, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: !llvm.ptr, %arg11: !llvm.ptr, %arg12: i64, %arg13: i64, %arg14: i64) {
%0 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%1 = llvm.insertvalue %arg10, %0[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%2 = llvm.insertvalue %arg11, %1[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%3 = llvm.insertvalue %arg12, %2[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%4 = llvm.insertvalue %arg13, %3[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%5 = llvm.insertvalue %arg14, %4[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%6 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%7 = llvm.insertvalue %arg5, %6[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%8 = llvm.insertvalue %arg6, %7[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%9 = llvm.insertvalue %arg7, %8[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%10 = llvm.insertvalue %arg8, %9[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%11 = llvm.insertvalue %arg9, %10[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%12 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%13 = llvm.insertvalue %arg0, %12[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%14 = llvm.insertvalue %arg1, %13[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%15 = llvm.insertvalue %arg2, %14[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%16 = llvm.insertvalue %arg3, %15[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%17 = llvm.insertvalue %arg4, %16[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%18 = llvm.mlir.constant(0 : index) : i64
%19 = builtin.unrealized_conversion_cast %18 : i64 to index
%20 = llvm.extractvalue %17[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%21 = builtin.unrealized_conversion_cast %20 : i64 to index
%22 = llvm.mlir.constant(64 : index) : i64
%23 = builtin.unrealized_conversion_cast %22 : i64 to index
%24 = affine.apply #map()[%21, %23]
gpu.launch blocks(%arg15, %arg16, %arg17) in (%arg21 = %24, %arg22 = %19, %arg23 = %19) threads(%arg18, %arg19, %arg20) in (%arg24 = %23, %arg25 = %19, %arg26 = %19) {
%block_id_x = gpu.block_id x
%25 = builtin.unrealized_conversion_cast %block_id_x : index to i64
%block_dim_x = gpu.block_dim x
%26 = builtin.unrealized_conversion_cast %block_dim_x : index to i64
%thread_id_x = gpu.thread_id x
%27 = builtin.unrealized_conversion_cast %thread_id_x : index to i64
%28 = llvm.mul %25, %26 : i64
%29 = llvm.add %28, %27 : i64
%30 = llvm.icmp “slt” %29, %20 : i64
llvm.cond_br %30, ^bb1, ^bb2
^bb1: // pred: ^bb0
%31 = llvm.extractvalue %17[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%32 = llvm.getelementptr inbounds|nuw %31[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
%33 = llvm.load %32 : !llvm.ptr → f32
%34 = llvm.extractvalue %11[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%35 = llvm.getelementptr inbounds|nuw %34[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
%36 = llvm.load %35 : !llvm.ptr → f32
%37 = llvm.fadd %33, %36 : f32
%38 = llvm.extractvalue %5[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%39 = llvm.getelementptr inbounds|nuw %38[%29] : (!llvm.ptr, i64) → !llvm.ptr, f32
llvm.store %37, %39 : f32, !llvm.ptr
llvm.br ^bb2
^bb2: // 2 preds: ^bb0, ^bb1
gpu.terminator
}
llvm.return
}
}

rkkulkarni · October 18, 2025, 5:04am

Thanks. Will explore and revert.

matthias-springer · October 18, 2025, 4:23pm

You are missing passes in your pipeline that lower affine.apply and GPU dialect ops. You can see from the IR that the builtin.unrealized_conversion_cast ops are at the boundary between LLVM dialect ops and other (not yet lowered) ops.

rkkulkarni · October 21, 2025, 11:35pm

Thanks for pointing this. Will check.

rkkulkarni · October 22, 2025, 8:15am

After adding more passes, I get the following output. The issue is still not resolved!

mlir-opt cuda_vadd.mlir --canonicalize --lower-affine --gpu-launch-sink-index-computations --convert-scf-to-cf --convert-gpu-to-spirv --convert-cf-to-llvm --arith-expand --convert-arith-to-llvm --finalize-memref-to-llvm --cse --convert-func-to-llvm --convert-arith-to-llvm --convert-xevm-to-llvm --reconcile-unrealized-casts
module {
llvm.func @vector_add_host(%arg0: !llvm.ptr, %arg1: !llvm.ptr, %arg2: i64, %arg3: i64, %arg4: i64, %arg5: !llvm.ptr, %arg6: !llvm.ptr, %arg7: i64, %arg8: i64, %arg9: i64, %arg10: !llvm.ptr, %arg11: !llvm.ptr, %arg12: i64, %arg13: i64, %arg14: i64) {
%0 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%1 = llvm.insertvalue %arg10, %0[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%2 = llvm.insertvalue %arg11, %1[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%3 = llvm.insertvalue %arg12, %2[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%4 = llvm.insertvalue %arg13, %3[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%5 = llvm.insertvalue %arg14, %4[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%6 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%7 = llvm.insertvalue %arg5, %6[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%8 = llvm.insertvalue %arg6, %7[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%9 = llvm.insertvalue %arg7, %8[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%10 = llvm.insertvalue %arg8, %9[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%11 = llvm.insertvalue %arg9, %10[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%12 = llvm.mlir.poison : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%13 = llvm.insertvalue %arg0, %12[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%14 = llvm.insertvalue %arg1, %13[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%15 = llvm.insertvalue %arg2, %14[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%16 = llvm.insertvalue %arg3, %15[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%17 = llvm.insertvalue %arg4, %16[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%18 = llvm.mlir.constant(64 : index) : i64
%19 = builtin.unrealized_conversion_cast %18 : i64 to index
%20 = llvm.mlir.constant(0 : index) : i64
%21 = builtin.unrealized_conversion_cast %20 : i64 to index
%22 = llvm.mlir.constant(1 : index) : i64
%23 = llvm.extractvalue %17[3] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%24 = llvm.alloca %22 x !llvm.array<1 x i64> : (i64) → !llvm.ptr
llvm.store %23, %24 : !llvm.array<1 x i64>, !llvm.ptr
%25 = llvm.getelementptr %24[0, %20] : (!llvm.ptr, i64) → !llvm.ptr, !llvm.array<1 x i64>
%26 = llvm.load %25 : !llvm.ptr → i64
%27 = llvm.mlir.constant(63 : index) : i64
%28 = llvm.add %26, %27 : i64
%29 = builtin.unrealized_conversion_cast %28 : i64 to index
gpu.launch blocks(%arg15, %arg16, %arg17) in (%arg21 = %29, %arg22 = %21, %arg23 = %21) threads(%arg18, %arg19, %arg20) in (%arg24 = %19, %arg25 = %21, %arg26 = %21) {
%30 = llvm.alloca %22 x !llvm.array<1 x i64> : (i64) → !llvm.ptr
llvm.store %23, %30 : !llvm.array<1 x i64>, !llvm.ptr
%31 = llvm.getelementptr %30[0, %20] : (!llvm.ptr, i64) → !llvm.ptr, !llvm.array<1 x i64>
%32 = llvm.load %31 : !llvm.ptr → i64
%block_id_x = gpu.block_id x
%33 = builtin.unrealized_conversion_cast %block_id_x : index to i64
%block_dim_x = gpu.block_dim x
%34 = builtin.unrealized_conversion_cast %block_dim_x : index to i64
%thread_id_x = gpu.thread_id x
%35 = builtin.unrealized_conversion_cast %thread_id_x : index to i64
%36 = llvm.mul %33, %34 : i64
%37 = llvm.add %36, %35 : i64
%38 = llvm.icmp “slt” %37, %32 : i64
llvm.cond_br %38, ^bb1, ^bb2
^bb1: // pred: ^bb0
%39 = llvm.extractvalue %17[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%40 = llvm.getelementptr inbounds|nuw %39[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
%41 = llvm.load %40 : !llvm.ptr → f32
%42 = llvm.extractvalue %11[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%43 = llvm.getelementptr inbounds|nuw %42[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
%44 = llvm.load %43 : !llvm.ptr → f32
%45 = llvm.fadd %41, %44 : f32
%46 = llvm.extractvalue %5[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%47 = llvm.getelementptr inbounds|nuw %46[%37] : (!llvm.ptr, i64) → !llvm.ptr, f32
llvm.store %45, %47 : f32, !llvm.ptr
llvm.br ^bb2
^bb2: // 2 preds: ^bb0, ^bb1
gpu.terminator
}
llvm.return
}
}

matthias-springer · October 22, 2025, 8:23am

You’re still missing a pass that lowers GPU dialect ops. E.g., for NVIDIA: -gpu-lower-to-nvvm-pipeline.

For learning purposes, instead of writing a new test case from scratch, I would start experimenting with an existing integration test. E.g.: test/Integration/CUDA/printf.mlir for a “Hello World” test.

rkkulkarni · October 22, 2025, 8:26am

Oh! Thanks very much for the suggestion. Will check.

rkkulkarni · October 23, 2025, 5:14am

I am currently trying to compile for non-gpu architecture. I learnt that the nvvm related features are for gpu architectures.

matthias-springer · October 23, 2025, 7:47am

Yes, NVVM is NVIDIA-specific. If you are compiling for non-GPU architectures, I’d remove all gpu dialect ops from your test case. The lowering of ops such as gpu.launch is quite architecture specific.

Topic		Replies	Views
Error before getting to --reconcile-unrealized-casts MLIR	2	78	January 7, 2026
Not understanding why I am getting "failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal" MLIR llvm	4	1791	March 23, 2023
Chains of unrealized casts MLIR	7	954	July 29, 2022
Failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal MLIR llvm , mlir	1	126	July 16, 2025
PSA: run -reconcile-unrealized-casts after all -convert-*-to-llvm from now on Deprecation & Important Refactoring	44	3279	January 5, 2022

Reconciling unrealized casts

Related topics