Issues with function bodies getting replaced with unreachable after instrumentation

osterhoutan-UofU · March 18, 2024, 11:16pm

I am making a simple trace based data race checker for AMD’s HIP/ROCm GPU coding toolkit.
In the instrumentation of the GPU kernel I am using CloneFunctionInto to make a copy of the functions, that have function signatures expanded by one argument so that I can pass data in and through the kernel from the host/cpu.

I got the CloneFunctionInto working, and in debug print statements (printing out each new functions) I can see that every function gets copied and properly instrumented.
However, in the final IR that is produces after my pass runs, some but not all function bodies get replaced with unreachable
The only commonality between functions that get this treatment and those that don’t seems to be if I make any changes to the function body of the cloned functions.

Additionally for context in previous versions of the tool I did not need to extend the function signatures and was able to get runnable binaries out that contained my instrumented instructions. So I don’t think it is an issue with the changes I’m making inside the function bodies.

I have tried making new functions to call these cloned functions in case there is a code path optimization issue, but that did not work.
Additionally, yes I am RAUW all call instructions to the old functions with ones to the new functions.

I am not sure what to try next to determine what causes this issue or what steps I can do to fix it.
Any help is appreciated.

Artem-B · March 18, 2024, 11:30pm

It would help a lot to understand what’s causing the problem if you could post a reduced reproducer on https://llvm.godbolt.org/

osterhoutan-UofU · March 19, 2024, 8:13pm

Are you asking for a before and after of the instrumented IR or a reduced copy of my instrumentation pass?
Because I’m not sure if Godbolt is very useful or even supports the latter and is unnecessary for the former (and currently Godbolt does not support hip/rocm).

I can provide a copy of what the original & instrumented signatures look like as well as thinned out snippets of what my cloning code looks like (see below)

Original IR

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z15tick_all_kernelPU7_AtomicmPmPli(ptr addrspace(1) nocapture %0, ptr addrspace(1) nocapture writeonly %1, ptr addrspace(1) nocapture %2, i32 %3) local_unnamed_addr #0 !dbg !1756 {
  %5 = addrspacecast ptr addrspace(1) %2 to ptr
  tail call fastcc void @_Z10dummy_workPl(ptr %5) #10, !dbg !1768
; ...  
@_Z10dummy_workPl(ptr %5) #10, !dbg !1808
  ret void, !dbg !1809
}

Instrumented:

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z15tick_all_kernelPU7_AtomicmPmPli(ptr addrspace(1) nocapture %0, ptr addrspace(1) nocapture writeonly %1, ptr addrspace(1) nocapture %2, i32 %3, ptr %4) local_unnamed_addr #0 !dbg !1874 {
  unreachable
}

The cloning section(s)

    llvm::Function* ScabbardPassPlugin::replace_device_function(llvm::Function& F)
    {
      llvm::Module& M = *F.getParent();
      std::string old_name = F.getName().str();
      F.setName(old_name+"__old__scabbard_instr_replaced__old__");
      auto oldParamTys = F.getFunctionType()->params();
      std::vector<llvm::Type*> paramTys(oldParamTys.begin(), oldParamTys.end());
      paramTys.push_back(llvm::PointerType::get(F.getContext(),0ul));
      auto fn_callee = M.getOrInsertFunction(
                          old_name,
                          llvm::FunctionType::get(
                              F.getFunctionType()->getReturnType(),
                              llvm::ArrayRef<llvm::Type*>(paramTys),
                              F.getFunctionType()->isVarArg()
                            ),
                          F.getAttributes()
                        );
      llvm::Function* fn = llvm::dyn_cast<llvm::Function>(fn_callee.getCallee()); // new function (F is old function)
      llvm::ValueToValueMapTy vMap;
      for (size_t i=0; i<F.arg_size(); ++i)
        vMap[F.getArg(i)] = fn->getArg(i);
      llvm::SmallVector<llvm::ReturnInst*,8> rets;
      llvm::CloneFunctionInto(fn, &F, vMap, llvm::CloneFunctionChangeType::LocalChangesOnly, rets);
      to_replace.push_back(std::make_pair(&F,fn));
      return fn;
    }
    void ScabbardPassPlugin::finish_replacing_old_funcs_device()
    {
      //modify to pass device tracker through as last parameter to all functions defined in this module'
      for (auto& tr : to_replace) {
        auto OldFn = tr.first;
        auto NewFn = tr.second;
        for (auto& u : OldFn->uses()) {
          if (auto CI = llvm::dyn_cast<llvm::CallInst>(u.getUser())) {
            llvm::Function* iFn = CI->getFunction();
            if (iFn->getName().ends_with("__old__scabbard_instr_replaced__old__"))
              continue;
            llvm::SmallVector<llvm::Value*,4u> operands;
            for (auto& op : CI->args())
              operands.push_back(op.get());
            operands.push_back(iFn->getArg(iFn->arg_size()-1));
            auto ci = llvm::CallInst::Create(
                          NewFn->getFunctionType(),
                          NewFn,
                          llvm::ArrayRef<llvm::Value*>(operands)            
                        );
            if (CI->isTailCall())
              ci->setTailCallKind(CI->getTailCallKind());
            if (CI->canReturnTwice())
              ci->setCanReturnTwice();
            ci->insertBefore(CI);
            CI->replaceAllUsesWith(ci);
            ci->setDebugLoc(CI->getDebugLoc());
            CI->eraseFromParent();
          } else {
            LLVM_DEBUG(llvm::errs() << "\n[scabbard.instr.device:DBG] overwritten device function used in non-call instruction!\n";);
          }
        }
        // remove OldFn from module
        OldFn->eraseFromParent();
      }
      // create a bridge to the old section of the dummy function
      IRB.CreateBr(OldBB);
      // clear the list so we're ready for reuse
      to_replace.clear();
    }

The first func happens before the new function is instrumented with calls to trace code, and the second is called after all the original functions get processed.

jdoerfert · March 19, 2024, 8:26pm

Basically, provide the entire module for the most simple example you can find.
Just after instrumentation.

osterhoutan-UofU · March 19, 2024, 11:32pm

They don’t allow me to upload files because my account is too new so bare with the long reply.

here is what I get after I run:

hipcc -fpass-plugin=build/instr/libinstr.so -Lbuild/libtrace -ltrace -x hip -std=c++17 -g -emit-llvm -S -o test/managed_clock_test.instr.ll test/managed_clock_test.cpp

This is generated with the line that removes the old function from the module commented out so that you can see the contents of the old functions, however the contents of the new replacement functions stays the same regardless of if this line is included or not.

Device/GPU kernel module post instrumentation:

; __CLANG_OFFLOAD_BUNDLE____START__ hip-amdgcn-amd-amdhsa--gfx90a
; ModuleID = 'test/managed_clock_test.cpp'
source_filename = "test/managed_clock_test.cpp"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn-amd-amdhsa"

%"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" = type { i8 }
%"struct.scabbard::trace::device::DeviceTracker" = type { %"struct.scabbard::jobId_t", i64, i64, i64, i8, [128 x %"struct.scabbard::TraceData"] }
%"struct.scabbard::jobId_t" = type { i16, i16 }
%"struct.scabbard::TraceData" = type { i64, i16, [6 x i8], %"union.scabbard::ThreadId", ptr, %"struct.scabbard::LocationMetadata", i64 }
%"union.scabbard::ThreadId" = type <{ %"class.std::thread::id", [16 x i8] }>
%"class.std::thread::id" = type { i64 }
%"struct.scabbard::LocationMetadata" = type { i64, i32, i32 }

$scabbard.trace.device.dummyFunc = comdat any

$_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1xE = comdat any

$_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1yE = comdat any

$_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1zE = comdat any

$_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1xE = comdat any

$_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1yE = comdat any

$_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1zE = comdat any

@_ZN8scabbard5trace12_GLOBAL__N_114src_id_reg_tmpE = internal addrspace(1) global i64 84, align 8
@_ZN8scabbard5trace12_GLOBAL__N_115src_id_reg_tmp2E = internal addrspace(1) global i64 84, align 8
@_ZN8scabbard5trace12_GLOBAL__N_115src_id_reg_tmp3E = internal addrspace(1) global i64 84, align 8
@_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1xE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1
@_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1yE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1
@_ZN17__HIP_CoordinatesI14__HIP_BlockIdxE1zE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1
@_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1xE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1
@_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1yE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1
@_ZN17__HIP_CoordinatesI15__HIP_ThreadIdxE1zE = weak protected local_unnamed_addr addrspace(4) externally_initialized constant %"struct.__HIP_Coordinates<__HIP_ThreadIdx>::__X" undef, comdat, align 1

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z15tick_all_kernelPU7_AtomicmPmPli__old__scabbard_instr_replaced__old__(ptr addrspace(1) nocapture %0, ptr addrspace(1) nocapture writeonly %1, ptr addrspace(1) nocapture %2, i32 %3) local_unnamed_addr #0 !dbg !1756 {
  %5 = addrspacecast ptr addrspace(1) %2 to ptr
  tail call fastcc void @_Z10dummy_workPl__old__scabbard_instr_replaced__old__(ptr %5) #10, !dbg !1768
  %6 = atomicrmw add ptr addrspace(1) %0, i64 1 seq_cst, align 8, !dbg !1769
  %7 = add i64 %6, 1, !dbg !1769
  %8 = tail call i32 @llvm.amdgcn.workitem.id.x(), !dbg !1770, !range !1782, !noundef !1783
  %9 = add i32 %8, %3, !dbg !1784
  %10 = zext i32 %9 to i64, !dbg !1785
  %11 = getelementptr inbounds i64, ptr addrspace(1) %1, i64 %10, !dbg !1785
  store i64 %7, ptr addrspace(1) %11, align 8, !dbg !1786, !tbaa !1787
  fence syncscope("workgroup") release, !dbg !1791
  tail call void @llvm.amdgcn.s.barrier(), !dbg !1806
  fence syncscope("workgroup") acquire, !dbg !1807
  tail call fastcc void @_Z10dummy_workPl__old__scabbard_instr_replaced__old__(ptr %5) #10, !dbg !1808
  ret void, !dbg !1809
}

; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(argmem: readwrite)
define internal fastcc void @_Z10dummy_workPl__old__scabbard_instr_replaced__old__(ptr nocapture %0) unnamed_addr #1 !dbg !1810 {
  %2 = tail call i32 @llvm.amdgcn.workitem.id.x(), !dbg !1816, !range !1782, !noundef !1783
  %3 = zext i32 %2 to i64, !dbg !1816
  %4 = tail call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr(), !dbg !1817
  %5 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !1817
  %6 = load i32, ptr addrspace(4) %4, align 4, !dbg !1817, !tbaa !1818
  %7 = icmp ult i32 %5, %6, !dbg !1817
  %8 = select i1 %7, i64 6, i64 9, !dbg !1817
  %9 = getelementptr inbounds i16, ptr addrspace(4) %4, i64 %8, !dbg !1817
  %10 = load i16, ptr addrspace(4) %9, align 2, !dbg !1817, !tbaa !1822
  %11 = zext i16 %10 to i64, !dbg !1817
  %12 = zext i32 %5 to i64, !dbg !1824
  %13 = mul nuw nsw i64 %11, %12, !dbg !1825
  %14 = add nuw nsw i64 %13, %3, !dbg !1826
  %15 = shl i64 %14, 32, !dbg !1827
  %16 = ashr exact i64 %15, 32, !dbg !1827
  %17 = getelementptr inbounds i64, ptr %0, i64 %16, !dbg !1828
  %18 = load i64, ptr %17, align 8, !dbg !1829, !tbaa !1787
  %19 = add nsw i64 %16, %18, !dbg !1829
  store i64 %19, ptr %17, align 8, !dbg !1829, !tbaa !1787
  ret void, !dbg !1830
}

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z12dummy_kernelv__old__scabbard_instr_replaced__old__() local_unnamed_addr #0 !dbg !1831 {
  fence syncscope("workgroup") release, !dbg !1832
  tail call void @llvm.amdgcn.s.barrier(), !dbg !1836
  fence syncscope("workgroup") acquire, !dbg !1837
  ret void, !dbg !1838
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: write)
define protected amdgpu_kernel void @_Z10dummy_initPl__old__scabbard_instr_replaced__old__(ptr addrspace(1) nocapture writeonly %0) local_unnamed_addr #2 !dbg !1839 {
  %2 = tail call i32 @llvm.amdgcn.workitem.id.x(), !dbg !1843, !range !1782, !noundef !1783
  %3 = zext i32 %2 to i64, !dbg !1843
  %4 = tail call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr(), !dbg !1844
  %5 = tail call i32 @llvm.amdgcn.workgroup.id.x(), !dbg !1844
  %6 = getelementptr inbounds i16, ptr addrspace(4) %4, i64 6, !dbg !1844
  %7 = load i16, ptr addrspace(4) %6, align 4, !dbg !1844, !tbaa !1822
  %8 = zext i16 %7 to i64, !dbg !1844
  %9 = zext i32 %5 to i64, !dbg !1845
  %10 = mul nuw nsw i64 %8, %9, !dbg !1846
  %11 = add nuw nsw i64 %10, %3, !dbg !1847
  %12 = shl i64 %11, 32, !dbg !1848
  %13 = ashr exact i64 %12, 32, !dbg !1848
  %14 = getelementptr inbounds i64, ptr addrspace(1) %0, i64 %13, !dbg !1848
  store i64 0, ptr addrspace(1) %14, align 8, !dbg !1849, !tbaa !1787
  ret void, !dbg !1850
}

; Function Attrs: convergent mustprogress nocallback nofree nounwind willreturn
declare void @llvm.amdgcn.s.barrier() #3

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.x() #4

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare align 4 ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() #4

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workgroup.id.x() #4

; Function Attrs: mustprogress nofree norecurse nounwind willreturn memory(readwrite, inaccessiblemem: none)
define amdgpu_kernel void @scabbard.trace.device.dummyFunc(ptr addrspace(1) nocapture %0, i32 %1, float %2, i16 %3, ptr addrspace(1) %4, ptr addrspace(1) nocapture readnone %5, ptr addrspace(1) nocapture readonly %6) local_unnamed_addr #5 comdat {
  ; this is a manually instrumented in function not corresponding to any other function in the original code, body omitted to meet post character limit
  ret void
}

; Function Attrs: mustprogress nofree noinline norecurse nounwind willreturn memory(argmem: readwrite)
define internal fastcc void @"scabbard.trace.device.trace_append$mem"(ptr nocapture %0, i16 zeroext %1, ptr %2, ptr nocapture readonly %3, i32 %4, i32 %5) unnamed_addr #6 {
  ; this is a manually instrumented in function not corresponding to any other function in the original code, body omited to meet post character limit
  ret void
}

; Function Attrs: mustprogress nofree noinline norecurse nounwind willreturn memory(argmem: readwrite)
define internal fastcc void @"scabbard.trace.device.trace_append$alloc"(ptr nocapture %0, i16 zeroext %1, ptr %2, ptr nocapture readonly %3, i32 %4, i32 %5, i64 %6) unnamed_addr #6 {
  ; this is a manually instrumented in function not corresponding to any other function in the original code, body omited to meet post character limit
  ret void
}

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.start.p5(i64 immarg, ptr addrspace(5) nocapture) #7

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workgroup.id.y() #8

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workgroup.id.z() #8

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.y() #8

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.amdgcn.workitem.id.z() #8

; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: readwrite)
declare void @llvm.memcpy.p0.p5.i64(ptr noalias nocapture writeonly, ptr addrspace(5) noalias nocapture readonly, i64, i1 immarg) #9

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.end.p5(i64 immarg, ptr addrspace(5) nocapture) #7

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z15tick_all_kernelPU7_AtomicmPmPli(ptr addrspace(1) nocapture %0, ptr addrspace(1) nocapture writeonly %1, ptr addrspace(1) nocapture %2, i32 %3, ptr %4) local_unnamed_addr #0 !dbg !1874 {
  unreachable
}

; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(argmem: readwrite)
define dso_local fastcc void @_Z10dummy_workPl(ptr nocapture %0, ptr %1) unnamed_addr #1 !dbg !1880 {
  unreachable
}

; Function Attrs: convergent mustprogress nofree norecurse nounwind willreturn
define protected amdgpu_kernel void @_Z12dummy_kernelv(ptr %0) local_unnamed_addr #0 !dbg !1884 {
  fence syncscope("workgroup") release, !dbg !1885
  tail call void @llvm.amdgcn.s.barrier(), !dbg !1889
  fence syncscope("workgroup") acquire, !dbg !1890
  ret void, !dbg !1891
}

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: write)
define protected amdgpu_kernel void @_Z10dummy_initPl(ptr addrspace(1) nocapture writeonly %0, ptr %1) local_unnamed_addr #2 !dbg !1892 {
  unreachable
}

attributes #0 = { convergent mustprogress nofree norecurse nounwind willreturn "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot10-insts,+dot2-insts,+dot3-insts,+dot4-insts,+dot5-insts,+dot6-insts,+dot7-insts,+dpp,+gfx8-insts,+gfx9-insts,+gfx90a-insts,+mai-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
attributes #1 = { mustprogress nofree noinline norecurse nosync nounwind willreturn memory(argmem: readwrite) "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot10-insts,+dot2-insts,+dot3-insts,+dot4-insts,+dot5-insts,+dot6-insts,+dot7-insts,+dpp,+gfx8-insts,+gfx9-insts,+gfx90a-insts,+mai-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }
attributes #2 = { mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: write) "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot10-insts,+dot2-insts,+dot3-insts,+dot4-insts,+dot5-insts,+dot6-insts,+dot7-insts,+dpp,+gfx8-insts,+gfx9-insts,+gfx90a-insts,+mai-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
attributes #3 = { convergent mustprogress nocallback nofree nounwind willreturn }
attributes #4 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #5 = { mustprogress nofree norecurse nounwind willreturn memory(readwrite, inaccessiblemem: none) "amdgpu-flat-work-group-size"="1,1024" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot10-insts,+dot2-insts,+dot3-insts,+dot4-insts,+dot5-insts,+dot6-insts,+dot7-insts,+dpp,+gfx8-insts,+gfx9-insts,+gfx90a-insts,+mai-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" "uniform-work-group-size"="true" }
attributes #6 = { mustprogress nofree noinline norecurse nounwind willreturn memory(argmem: readwrite) "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx90a" "target-features"="+16-bit-insts,+atomic-buffer-global-pk-add-f16-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot1-insts,+dot10-insts,+dot2-insts,+dot3-insts,+dot4-insts,+dot5-insts,+dot6-insts,+dot7-insts,+dpp,+gfx8-insts,+gfx9-insts,+gfx90a-insts,+mai-insts,+s-memrealtime,+s-memtime-inst,+wavefrontsize64" }
attributes #7 = { nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) }
attributes #8 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #9 = { nocallback nofree nounwind willreturn memory(argmem: readwrite) }
attributes #10 = { nounwind }

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!1747, !1748, !1749, !1750, !1751, !1752, !1753}
!opencl.ocl.version = !{!1754, !1754}
!llvm.ident = !{!1755, !1755}

!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !1, producer: "AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.0 23483 7208e8d15fbf218deb74483ea8c549c67ca4985e)", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, retainedTypes: !11, globals: !28, imports: !72, splitDebugInlining: false, nameTableKind: None)
!1 = !DIFile(filename: "test/managed_clock_test.cpp", directory: "/g/g11/osterhou/repos/scabbard", checksumkind: CSK_MD5, checksum: "84927e4bf98b7e7efaf685dc29fe5570")
!2 = !{!3}
!3 = !DICompositeType(tag: DW_TAG_enumeration_type, name: "_Lock_policy", scope: !5, file: !4, line: 49, baseType: !6, size: 32, elements: !7, identifier: "_ZTSN9__gnu_cxx12_Lock_policyE")
; ...
; removed metadata to meet post character limit
; ...
; __CLANG_OFFLOAD_BUNDLE____END__ hip-amdgcn-amd-amdhsa--gfx90a

I apologize for such a long example module, but the example needed to contain a device function that is called from a global function in order to show that some functions don’t get the unreachable treatment.
In this example the “dummy_work” or @_Z10dummy_workPl / @_Z10dummy_workPl__old__scabbard_instr_replaced__old__ function is a device function that does not get “optimised out” or whatever is happening after my pass runs.

NOTE:
I can confirm via debug print statements during my pass that these new functions contain bodies that are appropriately instrumented before my pass finishes, but the output file generated by the command returns as you see above.

Some other info, my pass should be running after all other optimization passes run given it is loaded last and registers at the end of the optimization passes with the pass manager at that time.
I do return PreservedAnalysis::none() after my pass runs, but changing this does not seem to help at all.

Artem-B · March 19, 2024, 11:59pm

Compiler explorer, AKA http://godbolt.org is your friend. It’s much better suited for demonstration of compiler issues.

For the magically disappearing code I usually run compilation with -print-after-all passed to LLVM (so you will need to use -mllvm before it if you do that from clang) and then see where and how exactly it happens.

This should give you a clue which pass is responsible (usually, but sometimes it’s a chain of events that causes the surprises). Once you know the pass, dump IR just before that pass is invoked, and then use debug version of opt + additional debug options to dig into more details on why the code is gone. You may be able to get useful info with -R options, too.

osterhoutan-UofU · March 28, 2024, 9:36pm

Useful Info:

@Artem-B your advice in using -print-after-all was just what I needed even if I could not use godbolt’s excellent interface, beacause of my dependencies (see Side Note below) requirements and because even if I didn’t I didn’t want to learn how to get a custom llvm pass plugin working in Godbolt. But it works just fine from a normal terminal, no compiler explorer necessary. Though be sure to do the old stdout+stderr redirect trick ({cmd} >>{out-file} 2>&1) to save all the info to a file for review as there is a lot of output.

My Descoveries:

I was able to figure out that after my pass runs the InstCombinePass replaces all of my instrumented call instructions with poisoned store instructions that the SimplifyCFGPass promptly removes and determines that there is no reason to keep the rest of the function bodies (since the poison store instructions effectively make all instructions after unreachable).

My Question(s) Now:

Why might the InstCombinePass decide that these call instructions that report that they return and (most of the time) have side effects, are able to be removed?

What is the goal of the InstCombinePass?

all I can find about it’s goals is below and doesn’t seem to be relevant to calls to void functions containing side effects.

// InstructionCombining - Combine instructions to form fewer, simple
// instructions. This pass does not modify the CFG, and has a tendency to make
// instructions dead, so a subsequent DCE pass is useful.
//
// This pass combines things like:
//    %Y = add int 1, %X
//    %Z = add int 1, %Y
// into:
//    %Z = add int 2, %X

found in: llvm-project/llvm/include/llvm/Transforms/InstCombine /InstCombine.h:83

Why is this only happening to CallInst’s that get created during instrumentation?
- This did not happen in previous versions of my instrumentation pass where I didn’t need to add a function argument/parameter to pass through to the calls to the utility functions I instrumented in.
  - Is it due to the additional argument? (Is all of this because I don’t add metadata for this new argument?)

Side Note:

While I do agree that using the compiler explorer on godbolt.org is usually a great strategy in most situations, I must once again reiterate that as of march 2024 godbolt does not support AMD’s ROCm/hip which I require for my project and is pivotal for my issue as it effects what passes get run and with what configuration, and defines the target arch and target arch lib.

Contextual Information:

For those of you who don’t know ROCm is the “new” device library AMD has put out to use in combination with their hip GPU kernel programing language that is a blatant ripoff of NVidia’s CUDA toolkit for GPU kernels.

jdoerfert · March 28, 2024, 9:49pm

Your calls are UB. My money is on mismatching calling conventions between the call and the callee.
Other mismatches might be possible too.

Artem-B · March 28, 2024, 9:55pm

Compiler explorer does support HIP compilation, though it’s hiding under CUDA on compiler explorer: Compiler Explorer

The hip version is a bit out of date, but they do accept patches, if you need something more recent.

Artem-B · March 28, 2024, 9:59pm

BTW, if -print-after-all helped, compiler explorer provides similar functionality via “[output window]->Add new->Opt pipeline”

osterhoutan-UofU · April 22, 2024, 8:19pm

To all those interested the final solution I came up with tis a cop out.

I was able to prevent this issue from arising by moving when my pass happens, from Late optimization to Early link time.

I was able to figure out that it was a combination of a few function passes that resulted in the removal of the function bodies, but they were far too complicated and contradictory to track down what data was missing or wrong.

Therefore the only real solution that only really works because my tool in an instrumentation tool for verification work and not any form of production optimizer or compilation process.
Time will tell if other issues will arise from moving my pass into the link pass segment (especially with complicated builds), but I will address those issues when the time arises.

Artem-B · April 22, 2024, 8:51pm

but they were far too complicated and contradictory to track down what data was missing or wrong.

Welcome to the world of debugging compilers in general and LLVM in particular.

I assume that the instrumentation is essential for whatever it is you’re using it for. Figuring out what makes LLVM consider your code, and everything after it, unreachable would therefore be something you may really want to figure out. Moving problematic code down the compilation pipeline may keep your code around but does not make it more correct. What’s worse, it makes it harder to debug the root cause. The instrumentation code may still be broken, just in a less obvious way.

Topic		Replies	Views
Use list preservation when using Instruction::clone LLVM Dev List Archives	3	78	October 3, 2014
CloneFunctionInto produces invalid debug info LLVM Dev List Archives	17	103	June 23, 2017
Cloned Functions Are Not the Same LLVM Dev List Archives	2	71	December 2, 2014
Remove debug calls from cloned function Beginners	0	153	August 23, 2022
Is there something to do with VMapped GlobalVariables after CloneFunctionInto? IR & Optimizations	2	90	January 19, 2024