RFC: Offload PGO for HIP (AMDGPU Device-Side Profile Guided Optimization)

yxsamliu · January 26, 2026, 7:26pm

RFC: Offload PGO for HIP (AMDGPU Device-Side Profile Guided Optimization)

Summary

This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.

The key contributions are:

Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).
Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.

The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.

Background

PGO Workflow (Brief Recap)

Profile Guided Optimization (PGO) is a compiler optimization technique that uses runtime profiling data to guide optimization decisions:

Instrumented build: Compiler inserts counters at basic block entries
Profile collection: Run representative workload, counters record execution frequencies
Profile-use build: Compiler reads profile data to optimize hot paths, inline frequently-called functions, and make better register allocation decisions

PGO is well-established for CPU code. This RFC extends PGO to GPU device code.

HIP and AMDGPU

HIP (Heterogeneous-Compute Interface for Portability) is AMD’s GPU programming model, similar to CUDA. HIP programs consist of:

Host code: Runs on CPU, manages memory and launches kernels
Device code: Runs on GPU as massively parallel kernels

SIMT Execution Model

AMDGPU uses the SIMT (Single Instruction, Multiple Threads) execution model:

Term	Definition
Lane	A single thread of execution
Wave	A group of lanes (32 or 64) executing in lockstep on the same instruction
Workgroup	A group of waves that can synchronize and share local memory
Divergence	When lanes within a wave take different branch paths

Key property: When a branch diverges, the wave must execute both paths with some lanes masked off. There is no “skipping” of the cold path if any lane needs it.

Memory Coalescing

GPU memory performance depends heavily on coalescing – combining multiple lane memory accesses into fewer, larger transactions:

Coalesced access: All lanes access contiguous addresses → single efficient transaction
Uncoalesced access: Lanes access scattered addresses → multiple slow transactions

When only some lanes are active (partial wave), memory accesses are inherently less efficient because fewer addresses can be combined per transaction.

Motivation

PGO is a well-established technique for CPU code optimization, but GPU kernels have fundamentally different execution characteristics due to the SIMT model.

The Problem: CPU PGO Assumptions Don’t Hold for GPU

On CPUs, PGO identifies “cold” code paths and optimizes accordingly – for example, placing register spills in rarely-executed blocks. This works because cold paths genuinely execute infrequently.

On GPUs with SIMT execution, this assumption breaks down:

Divergent branches: When threads within a wave take different paths, the wave must execute both paths (with some threads masked off).
Memory coalescing impact: If spills are placed in a “cold” path that still executes due to divergence, only a subset of threads access memory simultaneously, causing poor coalescing.
Performance inversion: The “optimization” of moving spills to cold paths can cause significant slowdowns when those paths execute with partial waves.

The Solution: Uniformity-Aware PGO

This RFC addresses the problem by:

Detecting branch uniformity at runtime – tracking whether blocks are entered uniformly (all threads together) or divergently (partial waves).
Gating sensitive optimizations – for divergent blocks, certain optimizations (like spill placement bias) are disabled while other PGO benefits are preserved.

This approach ensures PGO provides benefits where safe while preventing regressions on divergent code paths.

Use Cases

HIPRTC Applications (Runtime Compilation)

For applications using HIPRTC (runtime compilation), PGO enables workload-adaptive optimization:

Measurement phase: The application passes profile-generate options to the HIPRTC compilation API, producing instrumented kernels. These kernels are executed under typical workloads to collect profile data.
Production phase: The application passes profile-use options with the path to the collected profile data. HIPRTC compiles optimized kernels tailored to the observed workload characteristics.
Caching: Optimized kernels can be cached and reused as long as workload traits remain stable. When workload characteristics change, the application can repeat the measurement phase to adapt.

This workflow allows the same kernel source code to be optimized differently for different workload patterns (e.g., different input sizes, sparsity patterns, or branch distributions).

Static HIP Applications

For traditionally-compiled HIP applications:

Instrumented build: Build the application with profile-generate flags:
```
clang++ -fprofile-generate -x hip app.hip -o app_instrumented
```

Profile collection: Run the instrumented application with representative workloads:

./app_instrumented <typical_inputs>
# Produces: default.profraw, default.amdgcn-amd-amdhsa.0.profraw, ...

Merge profiles: Combine profile data from multiple runs:

llvm-profdata merge -o app.profdata default.profraw
llvm-profdata merge -o app.amdgcn-amd-amdhsa.profdata \
    default.amdgcn-amd-amdhsa.*.profraw

Optimized build: Rebuild with profile-use flags:

clang++ -fprofile-use=app.profdata -x hip app.hip -o app_optimized

The optimized build will have both host and device code optimized based on the collected profiles.

Scope

This document describes the design for device-side PGO for HIP (AMDGPU):

how device counters are instrumented and updated,
how device profile sections are discovered at runtime,
how device profile data is copied back to the host and written as a .profraw,
how that .profraw becomes .profdata, and
how the profile is consumed in a subsequent compilation.

The implementation spans:

LLVM/Clang (instrumentation + driver behavior),
compiler-rt profiling runtime (device collection/writing).

No CLR/HIP runtime patches are required. The implementation uses only standard HIP APIs (hipGetSymbolAddress, hipMemcpy) available in stock ROCm.

Terms

CUID (Compilation Unit ID): a unique identifier assigned to each translation unit (TU) during HIP compilation. Since host and device compilations for each TU are separate processes, CUID serves as a coordination mechanism – it allows host code to reference corresponding device symbols (e.g., registering shadow variables for device globals), distinguishes symbols from different TUs after linking (e.g., __profc_all_<CUID>), and prevents symbol collisions when multiple TUs define similarly-named functions.
host profile: profile data written for CPU code.
device profile: profile data written for GPU code (AMDGPU).
shadow variable: a host-side variable registered with HIP so that hipGetSymbolAddress() can resolve the corresponding device global.
section shadow variable: a host-side variable registered with HIP for a specific per-TU device section symbol (e.g. __profc_all_<CUID>). Used to pre-register device memory with CLR before hipMemcpy.
static-linked kernels: kernels loaded via standard HIP fatbin registration (__hipRegisterFatBinary / __hip_module_ctor).
dynamic modules: device code loaded/unloaded via hipModuleLoad* / hipModuleUnload.

High-level architecture

Key device symbol

The LLVM instrumentation pass (InstrProfiling.cpp) creates a per-TU device global (lives in GPU memory):

__llvm_offload_prf_<CUID>: a struct containing device pointers (begin/end) to the profile sections.

The symbol uses CUID suffix to avoid collision when multiple TUs are linked into a single device code object. On the host side, a corresponding shadow variable is registered with HIP during module initialization, allowing the runtime to resolve the device address via hipGetSymbolAddress().

For GPU targets these are per-TU contiguous sections (CUID-suffixed):

Section	Symbol	Contents	Purpose
Counters	`__profc_all_<CUID>`	`[N x i64]` array	Execution counts per basic block (expanded for per-wave slots)
Data	`__profd_all_<CUID>`	Array of `__llvm_profile_data` records	Function metadata: name hash, CFG hash, counter count, pointer to counters
Names	`__llvm_prf_names_<CUID>`	Compressed string blob	Function names for profile matching
Uniform counters	`__profu_all_<CUID>`	`[N x i64]` array	Counts only when all lanes enter block together (for uniformity detection)

How they are used:

During instrumented execution, kernel code increments counters in __profc_all_<CUID> (and __profu_all_<CUID> for uniform entries)
At program exit, the host runtime locates these sections via __llvm_offload_prf_<CUID>
The runtime copies section data from device to host memory
The data is written to a .profraw file, which llvm-profdata merge processes
During profile-use compilation, the compiler reads the merged profile and uses the data records to match counters to functions by name/hash

Conceptually (8 pointers, begin pointers first, then end pointers):

struct __llvm_offload_prf_<CUID> {
  i8* cnts_begin;   i8* data_begin;   i8* names_begin;   i8* ucnts_begin;
  i8* cnts_end;     i8* data_end;     i8* names_end;     i8* ucnts_end;
};

Key host runtime entrypoint

At profile dump time, the host profiling runtime calls:

__llvm_profile_hip_collect_device_data()

That function:

resolves each TU’s __llvm_offload_prf_<CUID> on the GPU via HIP APIs,
copies device profile sections back to host memory,
writes a device .profraw with a valid header and relocated pointers.

Data flow (end-to-end)

Phase A: Instrumented build (collect)

Build HIP program with instrumentation enabled (typical PGO “generate” mode).
Run workload.
Host runtime writes host .profraw.
Host runtime triggers device collection, producing one device .profraw per TU with TU-indexed device suffix:
- .amdgcn-amd-amdhsa.<TUIndex> (e.g., .amdgcn-amd-amdhsa.0, .amdgcn-amd-amdhsa.1)

Phase B: Merge

Use llvm-profdata merge to combine all device .profraw files (from all TUs) into a single .profdata:

llvm-profdata merge -o profile.amdgcn-amd-amdhsa.profdata \
    profile.*.amdgcn-amd-amdhsa.*.profraw

Phase C: Profile-use build (optimize)

Rebuild using -fprofile-use=<...>.profdata. Clang/HIP ensures the device compilation uses the device-suffixed profdata.

Compiler-side design (LLVM/Clang)

0) IR Instrumentation: How BB counters are defined (PGO-gen phase)

The first step in PGO is instrumenting the IR to insert counter increments at basic blocks.

Entry point

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp – inserts @llvm.instrprof.increment intrinsics
llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp – lowers intrinsics to actual counter operations

Step 1: Edge/block selection

FuncPGOInstrumentation::getInstrumentBBs() identifies which blocks need counters using a minimal spanning tree algorithm on the CFG. This minimizes the number of counters while still allowing full block frequency reconstruction.

NumCounters = NumInstrumentedBlocks + NumSelectInsts

Step 2: Intrinsic insertion

For each instrumented block, PGO inserts:

call void @llvm.instrprof.increment(
    i8* @__profn_<func>,    ; function name
    i64 <hash>,             ; CFG hash (detects code changes)
    i32 <num_counters>,     ; total counters for this function
    i32 <counter_idx>       ; which counter to increment
)

Step 3: Counter array creation

InstrProfiling.cpp lowers the intrinsics and creates:

Symbol	Type	Purpose
`@__profc_<func>`	`[N x i64]`	Counter array (N = num_counters)
`@__profd_<func>`	`%__llvm_profile_data`	Function metadata (hash, counter ptr, etc.)
`@__profn_<func>`	`[M x i8]`	Function name string

Step 4: AMDGPU-specific counter expansion

For AMDGPU targets, each basic block counter is expanded into multiple slots to alleviate atomic operation contention:

uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
if (TT.isAMDGPU())
    NumCounters *= (OffloadNumProfilingThreads + 1);  // default: x256

Why this is needed: On GPUs, thousands of waves may execute concurrently. If all waves atomically update the same counter, contention becomes a severe bottleneck. By providing 256 slots per counter and distributing waves evenly across slots, contention is reduced by ~256x.

Step 5: Counter index calculation (AMDGPU)

At runtime, lowerIncrementAMDGPU() maps each wave to a slot:

Slot = (blockHi << 5) | warpLocal
CounterIdx = OriginalCounterIdx x KSlots + Slot

Where:

KSlots = 1 << OffloadProfilingThreadBitWidth (default 256)
blockHi = high bits of workgroup ID
warpLocal = wave index within workgroup

This ensures waves are distributed evenly across slots.

Step 6: Wave-aggregated increment

Each wave counts its active lanes, but only the wave leader (first active lane) performs the atomic update:

Value *ActiveMask = Builder.CreateCall(BallotI32Fn, {true});  // which lanes active
Value *NumActive = Builder.CreateCall(CtpopI32Fn, {ActiveMask});
Value *StepTimesActive = Builder.CreateMul(IncStep, NumActive);
// Leader performs: counter[slot] += StepTimesActive

This approach:

Counts all active lanes accurately – each lane contributes to NumActive via the ballot intrinsic
Reduces atomics by 32x (or 64x) – only the leader updates, not every lane
Distributes across slots – waves map to different slots, reducing contention

Observed overhead: ~30% vs. uninstrumented code in practice.

1) Clang HIP driver: device profile-use filename rewriting

When the user passes -fprofile-use=<file>.profdata on a HIP compilation, the device compilation is given:

-fprofile-instr-use=<file>.amdgcn-amd-amdhsa.profdata

This avoids mixing CPU and GPU profile data in a single filename and makes the device profile consumption explicit.

2) LLVM instrumentation: `__llvm_offload_prf_<CUID>` creation (device side)

On AMDGPU device targets, LLVM creates per-TU profile data structures with contiguous allocation to ensure reliable section boundaries.

The Linker Ordering Problem

Unlike host PGO which relies on linker-generated __start_/__stop_ symbols, GPU code objects have two problems:

__start_/__stop_ symbols are NOTYPE – HIP cannot find them via hipGetSymbolAddress
Linker reorders symbols freely within a section – explicit anchor variables may not end up at section boundaries

For example, if we create __prf_start_cnts before counters and __prf_end_cnts after, the linker may still place them as:

__prf_end_cnts     <- Linker placed this first!
__profc_funcA
__profc_funcB
__prf_start_cnts   <- Linker placed this last!

Solution: Per-TU Contiguous Allocation

Instead of creating separate global variables for each function’s counters (which the linker can reorder), we allocate one contiguous array per TU for each profile section type.

Per-TU structure:

; All counters in one contiguous array (no reordering possible)
@__profc_all_<CUID> = [TotalCounters x i64] zeroinitializer, section "__llvm_prf_cnts_<CUID>"

; All data entries in one contiguous array
@__profd_all_<CUID> = [NumFunctions x %__llvm_profile_data] zeroinitializer, section "__llvm_prf_data_<CUID>"

; All uniform counters in one contiguous array  
@__profu_all_<CUID> = [TotalCounters x i64] zeroinitializer, section "__llvm_prf_ucnts_<CUID>"

; Metadata struct with known boundaries
; Note: the struct field order is standardized as:
;   begin pointers first, then end pointers.
@__llvm_offload_prf_<CUID> = {
    ptr @__profc_all_<CUID>,            ; cnts_begin
    ptr @__profd_all_<CUID>,            ; data_begin
    ptr @__llvm_prf_names_<CUID>,       ; names_begin
    ptr @__profu_all_<CUID>,            ; ucnts_begin
    ptr <end of counters>,              ; cnts_end   (computed: base + size)
    ptr <end of data>,                  ; data_end
    ptr <end of names>,                 ; names_end
    ptr <end of uniform counters>       ; ucnts_end
}

Key benefits:

Approach	Symbol Count	Linker Reordering	Boundary Reliability
Separate per-function	N symbols	Can reorder	Unreliable
Contiguous per-TU	1 symbol	N/A (single symbol)	Reliable

Implementation in InstrProfiling.cpp:

The InstrProfiling pass is a module pass, meaning all instrumentation intrinsics are visible before creating the contiguous arrays:

// Phase 1: Count total counters needed across all functions
void InstrLowerer::lower() {
    // First pass: collect all instrprof intrinsics, count total slots
    size_t TotalCounters = 0;
    for (auto &Inc : IncrementIntrinsics) {
        TotalCounters += Inc->getNumCounters() * KSlots;
    }
    
    // Allocate single contiguous array
    AllCounters = new GlobalVariable(
        ArrayType::get(Int64Ty, TotalCounters), ...);
    
    // Second pass: assign offsets to each function
    size_t Offset = 0;
    for (auto &Inc : IncrementIntrinsics) {
        FunctionOffsets[Inc->getName()] = Offset;
        Offset += Inc->getNumCounters() * KSlots;
    }
}

// Counter access uses offset into contiguous array
Value *getCounterAddress(InstrProfCntrInstBase *I) {
    size_t FuncOffset = FunctionOffsets[I->getName()];
    size_t CounterIdx = FuncOffset + I->getIndex() * KSlots + Slot;
    return GEP(AllCounters, CounterIdx);
}

Profile matching across builds:

The contiguous allocation is an implementation detail for GPU section discovery and does not change the fundamental matching semantics:

Matching is by function name/hash + counter count, not by GPU memory addresses.
Offsets/pointers are relocated when writing the .profraw so the raw reader can recover the intended counter ranges.
Source changes are handled normally – if a function’s CFG changes, its hash changes and profile mismatch warnings apply as usual.

Multi-TU programs:

Each TU produces its own CUID-suffixed device global __llvm_offload_prf_<CUID>, and the host-side ctor injects a distinct host shadow variable per TU and registers it with the runtime. At profile collection time, the runtime iterates all registered shadow variables and resolves each TU’s device global via hipGetSymbolAddress.

TU1: shadow var #0  -> device `__llvm_offload_prf_<CUID1>` -> `__profc_all_<CUID1>`, `__profd_all_<CUID1>`, ...
TU2: shadow var #1  -> device `__llvm_offload_prf_<CUID2>` -> `__profc_all_<CUID2>`, `__profd_all_<CUID2>`, ...

Each TU’s profile is written to a separate file with a TU index suffix (e.g., .0.profraw, .1.profraw), which are then merged with llvm-profdata merge.

CUID (Compilation Unit ID):

For HIP, LLVM first tries to extract a TU-unique CUID from a module global named like __hip_cuid_<hash>. If none is present, LLVM falls back to hashing the module name. This CUID is used to suffix per-TU section names (and a few per-TU symbol names) so they remain distinct across linked TUs.

The implementation includes “keep-alive” techniques so the symbols are emitted/retained by the linker.

3) LLVM instrumentation: host-side registration of `__llvm_offload_prf_<CUID>`

On the host side (i.e., not the GPU module), LLVM injects code in __hip_module_ctor to:

call __hipRegisterVar(...) to register a host shadow variable for the device global __llvm_offload_prf_<CUID>
call __llvm_profile_hip_register_shadow_variable(host_shadow_ptr)

The shadow-variable mechanism allows the profiling runtime to resolve the device global via:

hipGetSymbolAddress(&dev_ptr, host_shadow_ptr)

Additional: host-side registration of per-TU section symbols (HIP memcpy safety)

To avoid any HSA dependency and keep device->host copies HIP-only, the runtime must ensure
the device section pointers are tracked by CLR. This is achieved by:

Creating/ registering section shadow variables for the per-TU symbols:
- __profc_all_<CUID>, __profd_all_<CUID>, __profu_all_<CUID>, and __llvm_prf_names_<CUID>
Registering each shadow variable with HIP via __hipRegisterVar(...)
Registering each shadow variable with compiler-rt via:
- __llvm_profile_hip_register_section_shadow_variable(host_shadow_ptr)

At collection time, compiler-rt calls hipGetSymbolAddress on these section shadow variables to
pre-register the underlying device allocations with CLR before using hipMemcpy for section copies.

4) Counter update lowering on AMDGPU

The AMDGPU lowering strategy is wave-aggregated:

elect a leader lane per wave,
compute popcount(activeMask),
update the counter by step * numActive using:
- atomic leader mode (default)
- an alternate sampling/overflow mode (legacy/debug experimentation)

This reduces overhead vs. naive per-lane atomic increments.

Runtime-side design (compiler-rt)

1) Dynamic HIP symbol resolution

The ROCm-specific profiling runtime (InstrProfilingPlatformROCm.c) loads HIP at runtime:

dlopen("libamdhip64.so")
dlsym() for:
- hipGetSymbolAddress
- hipMemcpy
- hipModuleGetGlobal
- (optionally) hipMemcpyFromSymbol

This keeps compiler-rt decoupled from linking directly against HIP.

2) Static-linked collection via shadow variables

The runtime maintains a list of registered shadow variables:

__llvm_profile_hip_register_shadow_variable(void *ptr)

On collection:

for each shadow variable, resolve device address via hipGetSymbolAddress
call ProcessDeviceOffloadPrf(device_ptr)

Section symbol pre-registration (HIP memcpy-only)

In addition, the runtime maintains a list of section shadow variables:

__llvm_profile_hip_register_section_shadow_variable(void *ptr)

Before copying section payloads, the runtime calls hipGetSymbolAddress() on each of these
shadow variables to ensure CLR tracks the underlying device allocations.

3) Dynamic module support

The runtime tracks dynamically loaded HIP modules:

__llvm_profile_hip_register_dynamic_module(int rc, void **hipModuleOutPtr)
- uses hipModuleGetGlobal(..., "__llvm_offload_prf_<CUID>") to locate the module’s profile global
__llvm_profile_hip_unregister_dynamic_module(void *hipModule)
- triggers ProcessDeviceOffloadPrf(...) at unload time

The LLVM instrumentation inserts these calls adjacent to:

hipModuleLoad* (after successful load)
hipModuleUnload (before unload)

4) Writing the device `.profraw`

ProcessDeviceOffloadPrf():

copies the 8 pointers in __llvm_offload_prf_<CUID> device->host via hipMemcpy
computes section sizes
pre-registers per-TU section symbols via hipGetSymbolAddress (side effect: CLR memory tracking)
copies sections device->host via hipMemcpy (HIP-only; no HSA APIs)
writes a .profraw file with:
- a correct __llvm_profile_header
- relocated __llvm_profile_data.CounterPtr fields so they are valid offsets within the written file layout

Filename convention

The device writer uses the base filename from __llvm_profile_get_filename() (usually derived from LLVM_PROFILE_FILE) and inserts:

.amdgcn-amd-amdhsa.<TUIndex>

Each TU writes to a separate file with a TU index suffix to support multi-TU programs.

Example (single-TU):

base: default.profraw
device: default.amdgcn-amd-amdhsa.0.profraw

Example (multi-TU with 2 TUs):

base: default.profraw
TU 0: default.amdgcn-amd-amdhsa.0.profraw
TU 1: default.amdgcn-amd-amdhsa.1.profraw

Use llvm-profdata merge to combine profiles from all TUs.

CLR/HIP runtime interactions (no patches required)

The implementation uses only stock HIP APIs – no CLR patches are needed.

1) Why `hipGetSymbolAddress` is central

The profiling runtime relies on:

hipGetSymbolAddress(&device_ptr, host_shadow_ptr)

This requires that the host shadow variable be registered via __hipRegisterVar(...) during fatbin registration, which is why the LLVM host-side ctor injection is required. This uses standard HIP symbol registration available in all ROCm versions.

2) Verified with stock CLR

E2E tests pass with stock ROCm CLR (tested with a stock ROCm Docker image). No special environment variables or CLR patches are required.

No special environment variables or CLR patches required.

Environment variables / debug knobs

LLVM_PROFILE_FILE: base profile output filename template (host); device output is derived from this.
LLVM_PROFILE_VERBOSE=1: enables verbose logging in ROCm profiling runtime.

Profile Reading: NumCounters Expansion

When reading GPU device profraw files, the reader must account for the counter expansion:

The Dual Representation

The numeric values below are an example for a kernel/function with (B=5) base counters (roughly “basic blocks”) and the default (K=256) offload profiling slots:
[
\text{counters in file} = B \times K,\quad \text{and } \texttt{NumOffloadProfilingThreads} = K - 1.
]

Field	Stored Value	Meaning
`Data->NumCounters`	5	Example base counter count (B) (used for profile matching)
`Data->NumOffloadProfilingThreads`	255	Example (K-1) where (K) is slots-per-counter (default (K=256))
Actual counters in file	1280	Example (B \times K = 5 \times 256)

Reader Implementation (`InstrProfReader.cpp`)

The readRawCounts() function must expand NumCounters when reading GPU profiles:

template <class IntPtrT>
Error RawInstrProfReader<IntPtrT>::readRawCounts(InstrProfRecord &Record) {
  uint32_t NumCounters = swap(Data->NumCounters);
  if (NumCounters == 0)
    return error(instrprof_error::malformed, "number of counters is zero");

  // For GPU profiles with per-slot counters, the actual number of counter
  // entries in the file is NumCounters * (NumOffloadProfilingThreads + 1).
  uint16_t NumOffloadThreads = swap(Data->NumOffloadProfilingThreads);
  if (NumOffloadThreads > 0)
    NumCounters *= (NumOffloadThreads + 1);

  // ... read NumCounters values from file
}

Why this is necessary:

createDataVariable() stores NumCounters = B (base count) for profile matching
The actual counter array has B x K slots for concurrent wave updates (default (K=256) on AMDGPU)
Without this expansion, the reader would only read B counters, leaving (BxK - B) unread
This would corrupt subsequent data entries and crash during profile processing

Profile Format Version 14

This RFC introduces Version 14 of the indexed profile format (bumped from Version 13).

Note on raw .profraw format versioning: the raw format version remains INSTR_PROF_RAW_VERSION = 10. The new NumOffloadProfilingThreads field is placed to preserve the overall __llvm_profile_data record size (it occupies previously-unused alignment/padding space between NumValueSites[] and NumBitmapBytes). This avoids ABI/layout churn, but correct interpretation of offload profiles requires a toolchain that understands the new semantics (counter expansion + reduction and uniformity gating).

New Fields in Version 14

Field	Location	Type	Purpose
`NumOffloadProfilingThreads`	`__llvm_profile_data` struct	`uint16_t`	Number of per-slot counters (256 default for AMDGPU)
`UniformityBits`	Indexed profile record	`uint8_t[]`	1 bit per block indicating uniform (1) or divergent (0)

Data Record Layout Change

The INSTR_PROF_DATA macro in InstrProfData.inc adds a new field after NumValueSites:

INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
                NumOffloadProfilingThreads,
                ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
                                 NumOffloadProfilingThreadsVal))

This field is stored in the raw profile (__llvm_profile_data entries) and used by the reader to expand counters correctly.

Indexed Profile Record Extension

When writing indexed profiles, Version 14 adds UniformityBits after BitmapBytes:

[FuncHash: 8 bytes]
[NumCounters: 8 bytes]
[Counts: NumCounters x 8 bytes]
[NumBitmapBytes: 8 bytes]
[BitmapBytes: padded to 8-byte alignment]
[NumUniformityBits: 8 bytes]          // NEW in Version 14
[UniformityBits: padded to 8-byte alignment]  // NEW in Version 14
[ValueProfData: variable]

Backward Compatibility

Scenario	Behavior
Old reader, new profile	Recognizes Version 14 as unknown, may reject or skip
New reader, old profile	Works correctly; `UniformityBits` not present, defaults to empty
`--write-prev-version`	Writes Version 11 format, skipping `UniformityBits` for compatibility with older toolchains

New Section Kind

Added IPSK_ucnts section for uniform counters:

INSTR_PROF_SECT_ENTRY(IPSK_ucnts,
                      INSTR_PROF_QUOTE(INSTR_PROF_UCNTS_COMMON),
                      INSTR_PROF_UCNTS_COFF, "__DATA,")

Section names:

ELF: __llvm_prf_ucnts
COFF: .lprfuc$M

Known Limitations

Performance validation scope: The performance results in this RFC come from synthetic microbenchmarks designed to stress register spilling. Real-world application performance will vary depending on register pressure, branch patterns, and other kernel characteristics. Validation on production HIP applications (e.g., rocBLAS, MIGraphX) is future work.
Dynamic module multi-TU: Dynamic modules loaded via hipModuleLoad use hipModuleGetGlobal to find the profile symbol. The current implementation looks for a fixed pattern; multi-TU dynamic modules would require symbol enumeration (future enhancement).
Binary IDs: device .profraw writer currently sets BinaryIdsSize = 0 (not supported yet).
Value profiling: not supported for device in this prototype (header fields set to 0).
Dynamic module robustness: dynamic modules are expected to be collected at unload; missing unload processing is warned.
Partial wave handling: Partial waves (last wave in dispatch) are conservatively marked as divergent.
Wave32-only uniformity check (current): the current instrumentation uses wave32-specific intrinsics/masks (e.g. llvm.amdgcn.ballot.i32 and 0xFFFFFFFF full-wave mask). Wave64 support is future work.
Uniformity-to-MBB mapping (current): SpillPlacement currently uses an approximation (ProfileIdx = MBBNum % UniformityStr.size()) because block structure can change after IR profiling (e.g. during ISEL).

Verification checklist

Pipeline sanity

Run a known offload-PGO workload / microbenchmark with:
- optionally LLVM_PROFILE_VERBOSE=1 for debug output
Confirm both host and device .profraw are created.
Merge:
- llvm-profdata merge -o <out>.profdata <in1>.profraw [more...]
Rebuild with -fprofile-use=<out>.profdata and confirm:
- device compilation consumes <out>.amdgcn-amd-amdhsa.profdata (driver rewrite).

E2E Verification

Example end-to-end output from a synthetic microbenchmark (designed to stress register spilling):

========================================
PGO Spill E2E Test (high pressure)
========================================
LLVM:  <your clang/llvm-profdata toolchain>
Arch:  <your target GPU arch>
N=262144, Iterations=10000, Runs=5

[4/7] Merging profile...
  Found .unifcnts files (uniformity detection enabled)
  NumCounters: 1280, NumOffloadProfilingThreads: 255        # example numbers
  _Z15spill_kernel_v3Pdii: 5/5 blocks uniform (from .unifcnts)  # example
  InstrProfRecord::merge: reduced 1280 slots to 5 counters  # example reduction

[7/7] Benchmarking...
  Baseline: XXX ms
  PGO:      YYY ms
  Change:   ZZ% FASTER/SLOWER

Key verification points (pipeline correctness, not specific speedup values):

NumCounters: 1280 – example showing reader expanded (5 \times 256) slots
5/5 blocks uniform – example uniformity output
reduced 1280 slots to 5 counters – example merge reduction
PGO build completes successfully with profile consumption

Experimental Validation

Important caveat: The performance results in this section come from synthetic microbenchmarks specifically designed to stress register spilling and validate the uniformity gating mechanism. These results demonstrate that the approach works for its intended purpose, but should not be interpreted as representative of general HIP application performance. Real-world performance will depend on application-specific factors including register pressure, branch patterns, and kernel characteristics.

Microbenchmark Design

The validation microbenchmarks were designed to isolate the spill placement effect by:

Forcing register spills via tight VGPR limits (__attribute__((amdgpu_num_vgpr(16))))
Creating symmetric hot/cold paths with equal register pressure (16 VGPRs each, 32 total needed)
Varying branch uniformity to test uniform vs. divergent behavior

This design intentionally creates worst-case conditions for the problem we’re solving – kernels where spill placement significantly impacts performance.

Observed Results (Synthetic Benchmarks)

The following results were observed on the synthetic microbenchmarks:

Test Case	PGO Effect	Notes
Uniform branches	~12% faster	All threads spill together → good coalescing
Divergent branches (with gating)	~53-67% faster	Gating prevents regression; other PGO benefits apply
Without uniformity gating (IR test)	~3.7x slower	Demonstrates the problem this RFC addresses

The “3.7x slower” result without gating validates why uniformity detection is necessary – it shows the regression that can occur when CPU PGO assumptions are applied to divergent GPU code.

Why These Results May Not Generalize

Register pressure is artificial – real kernels may not have such tight VGPR limits
Branch structure is symmetric – real kernels often have asymmetric hot/cold paths
Spill-dominated performance – real kernels have other bottlenecks (memory bandwidth, compute, etc.)

Uniformity Definitions

Level	Definition	PGO Safe?
Wave-uniform	All threads in wave take same path (waves may differ)	Yes
Workgroup-uniform	All waves in workgroup take same path	Yes
Grid-uniform	All threads in grid take same path	Yes
Divergent	Threads within wave take different paths	No

Wave-uniformity is the critical threshold – if all threads in a wave agree, memory coalescing is good regardless of what other waves do.

Note on Counter-Based Heuristics (Design Rationale)

An earlier approach considered using existing per-slot counters to detect uniformity. The AMDGPU counter lowering tracks active lanes:

counter[slot] += step x NumActive   // NumActive = popcount(activeMask)

Naive heuristic: If a block’s per-slot counter deltas are always multiples of WaveSize (32 or 64), then each entry likely executed with a full wave active.

// Detection pseudocode
bool isWaveUniform(ArrayRef<uint64_t> slotCounters, unsigned waveSize) {
    for (uint64_t count : slotCounters) {
        if (count != 0 && count % waveSize != 0)
            return false;  // Partial wave execution detected
    }
    return true;
}

Uniformity-Aware PGO (Implemented)

Problem Statement

The counter-sum-based heuristic above is not robust for divergence detection because per-slot counters aggregate over all iterations. This can mask divergence (e.g., partial-wave entries can sum to a multiple of WaveSize). Therefore, we treat it as a fallback/approximation only and implement a dedicated uniform-entry signal.

Solution: Block-Level Divergence Tracking

We implemented a parallel counter array (UniformCounters) that tracks uniform block entries – incremented only when all active lanes enter a block together.

Implementation Overview

1. Instrumentation (`InstrProfiling.cpp`)

// At each block entry, check if execution is uniform
Value *ActiveMask = Builder.CreateCall(BallotI32Fn, {true});
Value *FullWaveMask = ConstantInt::get(Int32Ty, 0xFFFFFFFF);
Value *IsUniform = Builder.CreateICmpEQ(ActiveMask, FullWaveMask);

// Always increment regular counter
counter[slot] += step x NumActive;

// Conditionally increment uniform counter
if (IsUniform)
    uniform_counter[slot] += step x NumActive;

2. Profile Section (`InstrProfData.inc`)

Added new section kind for uniform counters:

INSTR_PROF_SECT_ENTRY(IPSK_ucnts, "__llvm_prf_ucnts", "__DATA,")

Extended __llvm_offload_prf_<CUID> structure from 6 to 8 pointers, and standardized
field order as begin pointers first, then end pointers (matches runtime):

struct __llvm_offload_prf_<CUID> {
    void *cnts_begin;
    void *data_begin;
    void *names_begin;
    void *ucnts_begin;   // Uniform counters (NEW)
    void *cnts_end;
    void *data_end;
    void *names_end;
    void *ucnts_end;
};

3. Runtime (`InstrProfilingPlatformROCm.c`)

The runtime:

pre-registers per-TU section symbols via hipGetSymbolAddress (CLR tracking)
copies sections device->host via hipMemcpy (HIP-only; no HSA dependency)
writes uniform counters to a separate .unifcnts file:

profile.123.amdgcn-amd-amdhsa.profraw    # Regular profile
profile.123.amdgcn-amd-amdhsa.unifcnts   # Uniform counters

.unifcnts file format:

Field	Type	Description
Magic	uint64	`0x55434E5450524F46` (“UCNTPROF”)
Version	uint64	1
NumCounters	uint64	Number of counter values
CountersSize	uint64	Size in bytes
Counters	uint64	Raw counter data

4. Profile Merge (`llvm-profdata.cpp`)

During merge, the tool reads both files and computes per-block uniformity:

// Read .unifcnts file alongside .profraw
std::vector<uint64_t> UniformCounters;
readUniformCountersFile(ProfileFilename, UniformCounters);

// For each block, compute uniformity ratio
for (size_t BlockIdx = 0; BlockIdx < NumBlocks; ++BlockIdx) {
    uint64_t TotalCount = sum(Counts[BlockIdx * Slots : (BlockIdx+1) * Slots]);
    uint64_t UniformCount = sum(UniformCounters[BlockIdx * Slots : (BlockIdx+1) * Slots]);
    
    // 90% threshold for uniformity
    bool IsUniform = (TotalCount == 0) || 
                     ((double)UniformCount / TotalCount >= 0.9);
    
    UniformityBits[BlockIdx] = IsUniform ? 1 : 0;
}

5. Compiler Consumption

Uniformity is propagated to SpillPlacement via function attribute:

// PGOInstrumentation.cpp (PGO-use phase)
F.addFnAttr("amdgpu-block-uniformity", "UUDUU");  // U=Uniform, D=Divergent

// SpillPlacement.cpp
BlockFrequency getAdjustedFrequency(unsigned MBBNum) {
    // If -amdgpu-flatten-spill-frequency is enabled: flatten ALL blocks.
    //
    // Otherwise, if `amdgpu-block-uniformity` is present: flatten only divergent
    // blocks and keep PGO frequency for uniform blocks.
    //
    // Note: current implementation approximates mapping from MBB -> profile block
    // index via `ProfileIdx = MBBNum % UniformityStr.size()` due to ISEL block
    // splitting/merging; this can be improved by carrying a stable mapping.
}

Test Results

Kernel	Branch Condition	Uniformity Detection	Block Pattern
Uniform microbenchmark (example)	`i % 10000 == 0`	5/5 blocks uniform	`[U, U, U, U, U]`
Divergent microbenchmark (example)	`(gid+i) % 10000 == 0`	4/5 blocks uniform	`[D, D, U, U, U]`

Uniformity Detection Justification

Why Uniformity Gating is Necessary

The uniformity detection mechanism exists to prevent a specific failure mode: PGO-guided spill placement causing regressions on divergent branches.

The mechanism:

PGO tells regalloc “cold path is rare”
Regalloc places spills preferentially in cold path
With divergent branches, cold path still executes (partial waves in SIMT)
Partial wave scratch access = poor memory coalescing
Memory latency dominates, causing slowdown instead of speedup

Synthetic microbenchmarks (designed to isolate this effect using IR-level branch_weights metadata) confirm that this regression can be severe on spill-heavy kernels with divergent branches.

Uniformity gating (-amdgpu-flatten-spill-frequency) addresses this by:

Flattening spill placement frequencies for divergent blocks
Preserving PGO frequencies for other optimizations (scheduling, layout, etc.)
Allowing PGO benefits on uniform blocks where spill placement is safe

E2E Test Design

The e2e microbenchmarks (internal tracking) use kernels designed to force register spills via tight VGPR limits:

__attribute__((amdgpu_num_vgpr(16)))  // Force spills
void spill_kernel(...) {
    // Hot path: 8 doubles = 16 VGPRs
    double h0, h1, h2, h3, h4, h5, h6, h7;
    
    // Cold path: 8 doubles = 16 VGPRs  
    double c0, c1, c2, c3, c4, c5, c6, c7;
    
    for (int i = 0; i < iterations; i++) {
        if (CONDITION) {
            // HOT PATH (99.99%): update h0-h7
        } else {
            // COLD PATH (0.01%): update c0-c7
        }
    }
}

Key design points:

VGPR limit of 16 cannot fit both paths (16 + 16 = 32 needed) → must spill
Symmetric structure: both paths have equal register pressure
Cold path frequency: 1 in 10000 iterations

Branch conditions determine uniformity:

Test	Branch Condition	Uniformity	Why
Uniform v3	`i % 10000 != 0`	All threads same	Loop counter `i` is identical for all threads
Divergent v3	`(gid+i) % 10000 != 0`	Threads differ	`gid` (thread ID) makes condition thread-dependent
Divergent HOT v3	`(gid+i) % 100 != 0`	Both paths divergent	Higher cold frequency → both blocks entered divergently

E2E Test Design (Synthetic Microbenchmarks)

The E2E tests use synthetic kernels with the following characteristics:

Test	Branch Condition	Uniformity	Why
Uniform v3	`i % 10000 != 0`	All threads same	Loop counter `i` is identical for all threads
Divergent v3	`(gid+i) % 10000 != 0`	Threads differ	`gid` (thread ID) makes condition thread-dependent
Divergent HOT v3	`(gid+i) % 100 != 0`	Both paths divergent	Higher cold frequency

These tests verify that:

The uniformity detection correctly identifies uniform vs. divergent blocks
PGO optimization applies successfully without regression on divergent code
The full pipeline (profile collection, merge, consumption) works end-to-end

Why Divergent Kernels Can Benefit from PGO

Even when SpillPlacement frequencies are flattened for divergent blocks, PGO can still provide benefits.

This is because uniformity gating is surgical – it only affects SpillPlacement:

Pass	Uses PGO Frequencies?	Gated by Uniformity?
SpillPlacement	Yes	YES (flattened for divergent blocks)
MachineScheduler	Yes	No
BasicBlock Layout	Yes	No
Loop Transformations	Yes	No
Inlining	Yes	No
s_clause Grouping	Yes	No

Benefits still realized for divergent code:

Better instruction scheduling – MachineScheduler uses PGO frequencies
Better code layout – Hot code placed in .text.hot. section (cache benefits)
Better clause grouping – s_clause sizes improved (0x4 vs 0x3 observed)
Branch probability effects – Compiler knows which path is likely

Heuristic Summary

The current approach uses a simple, safe heuristic:

for each basic block:
    if block is DIVERGENT (detected from .unifcnts):
        flatten frequency for SpillPlacement
    else:
        use actual PGO frequency

This is sufficient because:

Symmetric kernels (balanced pressure): Flattening has no effect (frequencies already similar)
Asymmetric kernels (cold has higher pressure): Flattening prevents catastrophic regression
All kernels: Other PGO passes still benefit from frequencies

More sophisticated pressure-aware heuristics were considered but deemed unnecessary:

The simple heuristic prevents the worst-case regression
Adding complexity provides marginal benefit
Safe default is more important for initial upstream

Future Work

Enhancements
- Support wave64 targets (gfx9)
- Handle partial waves correctly (capture initial mask at kernel entry)
- Integrate with static uniformity analysis as fallback
Profile format improvements
- Consider merging .unifcnts into main .profraw format
- Add uniformity to indexed profile format

References

LLVM Pull Request

PR #177665: [PGO][AMDGPU] Add offload profiling with uniformity-aware optimization

Code Entrypoints

LLVM/Clang:

clang/lib/Driver/ToolChains/HIPAMD.cpp – profile-use filename rewrite
llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp – AMDGPU lowering, uniform counters, __llvm_offload_prf_<CUID>
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp – consumes UniformityBits and sets amdgpu-block-uniformity
llvm/lib/CodeGen/SpillPlacement.cpp – uniformity-gated spill placement (-amdgpu-flatten-spill-frequency)
llvm/include/llvm/ProfileData/InstrProfData.inc – profile section definitions (IPSK_ucnts)
llvm/include/llvm/ProfileData/InstrProf.h – UniformityBits / NumOffloadProfilingThreads fields + isBlockUniform()
llvm/lib/ProfileData/InstrProfReader.cpp – raw reader expands counters using NumOffloadProfilingThreads
llvm/lib/ProfileData/InstrProfWriter.cpp – indexed writer serializes UniformityBits (omitted under --write-prev-version)
llvm/lib/ProfileData/InstrProf.cpp – merge reduces per-slot counters and carries UniformityBits
llvm/tools/llvm-profdata/llvm-profdata.cpp – .unifcnts reading and uniformity computation

compiler-rt:

compiler-rt/lib/profile/InstrProfilingPlatformROCm.c – HIP device collection, .unifcnts writing, TU-indexed filenames
compiler-rt/lib/profile/InstrProfilingFile.c – calls __llvm_profile_hip_collect_device_data()

CLR (HIP Runtime) – no changes required:

Uses stock hipGetSymbolAddress and hipMemcpy APIs
Verified working with stock ROCm CLR

jhuber6 · January 26, 2026, 8:00pm

I’ve been interested in PGO on GPUs for awhile. There was an attempt to get this to work on OpenMP but I don’t quite remember the status.

My first question is, how are we intending to build the runtime? I don’t think anything here is specific to HIP / AMDGPU so I’d prefer a generic solution if possible, at least on the device. I already have the infrastructure in place to build compiler-rt targeting various GPUs, and helpers that should likely abstract away any differences between them (AMDGPU, NVPTX, SPIR-V). Also it was my understanding that for trivially unfirom atomic operations the optimizer would already transform

atomic_fetch_add(ptr, N)

into

if (is_first_lane(lane_mask))
  atomic_fetch_add(ptr, N * popcount(lane_mask))

The host code is more difficult, but I don’t think we really need to encode it specifically for HIP. As far as I understand, this is mostly just a matter of looking up symbols and reading / writing to them. We could probably just abstract over that and dispatch to the specific runtime calls necessary to do it?

davidxl · January 27, 2026, 6:06pm

Nice design.

If I understand the design correctly, there is no uniformity detection at instrumentation runtime, but rather a heuristic is used in profile-use time to guess if the uniformity exists right?

To add uniformity detection, it is possible to reserve some upper bits in the profile counter, and the wave leader thread can flip the bit to non-uniform if the active number of threads is fewer than the number of lanes.

How is it done, JIT style?

The total counter section size is multipled by 256 and also the number of devices. What is the size implication here?

This can leave holes in the section when the function is GC’ed or it may prevent GC from happening.

How reliable is this heuristic (false positive possible)?

yxsamliu · January 28, 2026, 4:31pm

Hi Joseph, thanks for your interest!

Runtime Build / Generic Solution: The runtime is in compiler-rt/lib/profile/InstrProfilingPlatformROCm.c. It loads HIP with dlopen and dlsym. It is HIP/AMDGPU specific today. The host flow is symbol lookup, device copy, and module tracking. Shadow variable registration is required for hipGetSymbolAddress. These pieces could be abstracted into a generic offload runtime.

Uniform Atomic Optimization: The wave-aggregated update is explicit in lowerIncrementAMDGPU(). It uses llvm.amdgcn.ballot.i32 and ctpop to count active lanes. It assumes wave32 and checks for a full mask. We do this for correctness and to control slot mapping. We do not rely on optimizer heuristics.

Host Code Abstraction: The host side registers shadow variables at init. At dump time it walks them and copies sections. It writes .profraw and .unifcnts. This can sit behind a common interface for other runtimes.

yxsamliu · January 28, 2026, 4:32pm

Hi Xinliang, thanks for the review!

Uniformity Detection: We do runtime detection on AMDGPU. We increment a parallel UniformCounters array only when llvm.amdgcn.ballot.i32 equals the full Wave32 mask (0xFFFFFFFF). Partial waves are treated as divergent. At merge time, llvm-profdata computes UniformCount / TotalCount per block and uses a 0.9 threshold. If .unifcnts is missing and a wave size is provided, the merge code can fall back to the modulo heuristic. We chose a separate uniform counter array to avoid changing raw counter semantics or risking overflow.

Caching (JIT style): This is a HIPRTC workflow. The app compiles with -fprofile-generate, runs a workload, then recompiles with -fprofile-use. Caching is optional and app controlled. LLVM does not enforce it.

Counter Size: Default bitwidth is 8, so slots = 256. That is ~200KB for 100 counters per TU (100 * 256 * 8). We also allocate a uniform counter array of the same size. This is only for profiling builds. The knob is -mllvm -offload-profiling-thread-bitwidth.

Contiguous Allocation / GC: The per-TU contiguous arrays are sized at instrumentation time. If functions are removed later, their slots stay allocated. This wastes space but is correct. We use this to avoid GPU linker reordering. A per-function layout would avoid holes but is harder on GPU.

Heuristic Reliability: The 0.9 ratio is a heuristic. Rare divergence can still be classified as uniform. That is a possible false positive. False negatives are also possible if uniform counters are missing or noisy. We can tune the threshold if needed.

jhuber6 · January 28, 2026, 4:45pm

My concern is that we’ve already got basic support for this in OpenMP and would like to keep stuff common wherever possible.

Why do we need to restrict this to wave32 only? The necessary intrinsics should specialize on either w32 or w64 mode. Potentially you can save a VGPR by using u32 for the mask instead of u64 but that’s something we could optimize as well. I figured it’d just look like this, with the mask coming from the exec_mask builtin (which is just ballot(1)) Compiler Explorer.

Can you explain further how building this looks like in CMake? How are we providing the device code? I’d prefer if we just have a GPU compiler-rt that we link in if the user specifies instrumentation just like the host. The main difference being that we expect an external runtime to handle the final us

davidxl · January 28, 2026, 7:08pm

Makes sense (There was a prior heuristic based detection mentioned in RFC which confused me a little).

Is it possible to use linker script to force section ordering?

jdoerfert · January 28, 2026, 8:03pm

I didn’t read through it all, but you are aware that openmp offload has working device side pgo, right?

So multi module support and order things you need are all in upstream.

EthanLuisMcDonough · January 29, 2026, 2:29am

Hi Yaxun,

Nice work on this proposal, this sounds like a very promising project. I did a lot of work on the profiling support for OpenMP, so I wanted to make some comments on how we might unify these two efforts.

The profiling for OpenMP doesn’t use shadow variables. Instead, during deinitialization, the device plugin iterates through the device image’s symbols and finds global variables that have a profiling variable prefix (e.g. __profc_). These count values are then collected into contiguous memory and passed to the compiler-rt profiling library via the __llvm_write_custom_profile entrypoint. This function is defined as a weak symbol in the plugin source code and overridden if the profiling library is linked. It is here that the GPU specific profraw is created. This approach means that host and device profiling globals do not clash because the device counters do not have corresponding globals variables on the host.

The symbol gathering logic happens inside the nextgen plugins, which to my knowledge are not used by HIP. However, I think it you could re-use the existing compiler-rt GPU entrypoint if you added parameters for the branch uniformity counters:

int __llvm_write_custom_profile(const char *Target,
   const __llvm_profile_data *DataBegin,
   const __llvm_profile_data *DataEnd,
   const char *CountersBegin,
   const char *CountersEnd,
   const char *NamesBegin,
   const char *NamesEnd,
   const char *UCountsBegin,
   const char *UCountsEnd);

I’m not too familiar with how the HIP runtime is structured, but I do believe it would be best if both systems shared a common entrypoint in the profiling library. I think the ideal situation would involve moving your HIP calls into an interface similar to the nextgen plugins (e.g. something that handles loading/unloading of device images), although I’m not sure what that “similar interface” would be in the case of HIP. I don’t think the global array layout would be too disruptive to OpenMP’s profiling capacity (after all, the collected globals are fed a contiguous array anyway), but we might not need to change the profiling layout on GPUs if the aforementioned “similar interface” allows us to read the globals in the order they are defined.

One last superficial difference between the existing infrastructure and what you’ve outlined in this RFC is that GPU profraw files are prefixed by their target names (e.g. amdgcn-amd-amdhsa.profile.profraw rather than profile.amdgcn-amd-amdhsa.profraw).

If you need any help adjusting your project or think that the existing GPU PGO infrastructure needs some changes to mesh well with HIP profiling, please don’t hesitate to reach out.

Best regards,
Ethan

yxsamliu · January 30, 2026, 8:17pm

Thanks Ethan. I will try using __llvm_write_custom_profile for HIP, and rename the profraw to be consistent with OpenMP.

Topic		Replies	Views
RFC: HWPGO, i.e., adding new SPGO feedback types IR & Optimizations pgo , rfc	22	1139	December 24, 2024
RFC - Improvements to PGO profile support LLVM Dev List Archives	61	552	May 29, 2015
Profile-Guided Optimization (PGO) related questions and suggestions LLVM Project pgo	24	2171	December 20, 2023
RFC - Profile Guided Optimization in LLVM LLVM Dev List Archives	30	434	September 6, 2013
Proposal: add instrumentation for PGO and code coverage Clang Frontend	11	702	September 10, 2013

RFC: Offload PGO for HIP (AMDGPU Device-Side Profile Guided Optimization)

RFC: Offload PGO for HIP (AMDGPU Device-Side Profile Guided Optimization)

Summary

Background

PGO Workflow (Brief Recap)

HIP and AMDGPU

SIMT Execution Model

Memory Coalescing

Motivation

The Problem: CPU PGO Assumptions Don’t Hold for GPU

The Solution: Uniformity-Aware PGO

Use Cases

HIPRTC Applications (Runtime Compilation)

Static HIP Applications

Scope

Terms

High-level architecture

Key device symbol

Key host runtime entrypoint

Data flow (end-to-end)

Phase A: Instrumented build (collect)

Phase B: Merge

Phase C: Profile-use build (optimize)

Compiler-side design (LLVM/Clang)

0) IR Instrumentation: How BB counters are defined (PGO-gen phase)

Entry point

Step 1: Edge/block selection

Step 2: Intrinsic insertion

Step 3: Counter array creation

Step 4: AMDGPU-specific counter expansion

Step 5: Counter index calculation (AMDGPU)

Step 6: Wave-aggregated increment

1) Clang HIP driver: device profile-use filename rewriting

2) LLVM instrumentation: __llvm_offload_prf_<CUID> creation (device side)

The Linker Ordering Problem

Solution: Per-TU Contiguous Allocation

3) LLVM instrumentation: host-side registration of __llvm_offload_prf_<CUID>

Additional: host-side registration of per-TU section symbols (HIP memcpy safety)

4) Counter update lowering on AMDGPU

Runtime-side design (compiler-rt)

1) Dynamic HIP symbol resolution

2) Static-linked collection via shadow variables

Section symbol pre-registration (HIP memcpy-only)

3) Dynamic module support

4) Writing the device .profraw

Filename convention

CLR/HIP runtime interactions (no patches required)

1) Why hipGetSymbolAddress is central

2) Verified with stock CLR

Environment variables / debug knobs

Profile Reading: NumCounters Expansion

The Dual Representation

Reader Implementation (InstrProfReader.cpp)

Profile Format Version 14

New Fields in Version 14

Data Record Layout Change

Indexed Profile Record Extension

Backward Compatibility

New Section Kind

Known Limitations

Verification checklist

Pipeline sanity

E2E Verification

Experimental Validation

Microbenchmark Design

Observed Results (Synthetic Benchmarks)

Why These Results May Not Generalize

Uniformity Definitions

Note on Counter-Based Heuristics (Design Rationale)

Uniformity-Aware PGO (Implemented)

Problem Statement

Solution: Block-Level Divergence Tracking

Implementation Overview

1. Instrumentation (InstrProfiling.cpp)

2. Profile Section (InstrProfData.inc)

3. Runtime (InstrProfilingPlatformROCm.c)

4. Profile Merge (llvm-profdata.cpp)

5. Compiler Consumption

Test Results

Uniformity Detection Justification

2) LLVM instrumentation: `__llvm_offload_prf_<CUID>` creation (device side)

3) LLVM instrumentation: host-side registration of `__llvm_offload_prf_<CUID>`

4) Writing the device `.profraw`

1) Why `hipGetSymbolAddress` is central

Reader Implementation (`InstrProfReader.cpp`)

1. Instrumentation (`InstrProfiling.cpp`)

2. Profile Section (`InstrProfData.inc`)

3. Runtime (`InstrProfilingPlatformROCm.c`)

4. Profile Merge (`llvm-profdata.cpp`)