RFC: Offload PGO for HIP (AMDGPU Device-Side Profile Guided Optimization)
Summary
This RFC proposes device-side Profile Guided Optimization (PGO) for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU kernels.
The key contributions are:
-
Device PGO infrastructure – instrumentation, profile collection, and consumption pipeline for AMDGPU device code, using only standard HIP APIs (no CLR patches required).
-
Uniformity-aware PGO – a safety mechanism that detects whether branches are uniform (all threads take the same path) or divergent at runtime, and gates certain optimizations accordingly.
The uniformity detection is essential because GPU execution follows the SIMT (Single Instruction, Multiple Threads) model, where standard CPU PGO assumptions about “cold” code paths do not hold. Without this safeguard, PGO-guided optimizations like spill placement can cause performance regressions on divergent branches.
Background
PGO Workflow (Brief Recap)
Profile Guided Optimization (PGO) is a compiler optimization technique that uses runtime profiling data to guide optimization decisions:
- Instrumented build: Compiler inserts counters at basic block entries
- Profile collection: Run representative workload, counters record execution frequencies
- Profile-use build: Compiler reads profile data to optimize hot paths, inline frequently-called functions, and make better register allocation decisions
PGO is well-established for CPU code. This RFC extends PGO to GPU device code.
HIP and AMDGPU
HIP (Heterogeneous-Compute Interface for Portability) is AMD’s GPU programming model, similar to CUDA. HIP programs consist of:
- Host code: Runs on CPU, manages memory and launches kernels
- Device code: Runs on GPU as massively parallel kernels
SIMT Execution Model
AMDGPU uses the SIMT (Single Instruction, Multiple Threads) execution model:
| Term | Definition |
|---|---|
| Lane | A single thread of execution |
| Wave | A group of lanes (32 or 64) executing in lockstep on the same instruction |
| Workgroup | A group of waves that can synchronize and share local memory |
| Divergence | When lanes within a wave take different branch paths |
Key property: When a branch diverges, the wave must execute both paths with some lanes masked off. There is no “skipping” of the cold path if any lane needs it.
Memory Coalescing
GPU memory performance depends heavily on coalescing – combining multiple lane memory accesses into fewer, larger transactions:
- Coalesced access: All lanes access contiguous addresses → single efficient transaction
- Uncoalesced access: Lanes access scattered addresses → multiple slow transactions
When only some lanes are active (partial wave), memory accesses are inherently less efficient because fewer addresses can be combined per transaction.
Motivation
PGO is a well-established technique for CPU code optimization, but GPU kernels have fundamentally different execution characteristics due to the SIMT model.
The Problem: CPU PGO Assumptions Don’t Hold for GPU
On CPUs, PGO identifies “cold” code paths and optimizes accordingly – for example, placing register spills in rarely-executed blocks. This works because cold paths genuinely execute infrequently.
On GPUs with SIMT execution, this assumption breaks down:
- Divergent branches: When threads within a wave take different paths, the wave must execute both paths (with some threads masked off).
- Memory coalescing impact: If spills are placed in a “cold” path that still executes due to divergence, only a subset of threads access memory simultaneously, causing poor coalescing.
- Performance inversion: The “optimization” of moving spills to cold paths can cause significant slowdowns when those paths execute with partial waves.
The Solution: Uniformity-Aware PGO
This RFC addresses the problem by:
- Detecting branch uniformity at runtime – tracking whether blocks are entered uniformly (all threads together) or divergently (partial waves).
- Gating sensitive optimizations – for divergent blocks, certain optimizations (like spill placement bias) are disabled while other PGO benefits are preserved.
This approach ensures PGO provides benefits where safe while preventing regressions on divergent code paths.
Use Cases
HIPRTC Applications (Runtime Compilation)
For applications using HIPRTC (runtime compilation), PGO enables workload-adaptive optimization:
-
Measurement phase: The application passes profile-generate options to the HIPRTC compilation API, producing instrumented kernels. These kernels are executed under typical workloads to collect profile data.
-
Production phase: The application passes profile-use options with the path to the collected profile data. HIPRTC compiles optimized kernels tailored to the observed workload characteristics.
-
Caching: Optimized kernels can be cached and reused as long as workload traits remain stable. When workload characteristics change, the application can repeat the measurement phase to adapt.
This workflow allows the same kernel source code to be optimized differently for different workload patterns (e.g., different input sizes, sparsity patterns, or branch distributions).
Static HIP Applications
For traditionally-compiled HIP applications:
-
Instrumented build: Build the application with profile-generate flags:
clang++ -fprofile-generate -x hip app.hip -o app_instrumented -
Profile collection: Run the instrumented application with representative workloads:
./app_instrumented <typical_inputs> # Produces: default.profraw, default.amdgcn-amd-amdhsa.0.profraw, ... -
Merge profiles: Combine profile data from multiple runs:
llvm-profdata merge -o app.profdata default.profraw llvm-profdata merge -o app.amdgcn-amd-amdhsa.profdata \ default.amdgcn-amd-amdhsa.*.profraw -
Optimized build: Rebuild with profile-use flags:
clang++ -fprofile-use=app.profdata -x hip app.hip -o app_optimized
The optimized build will have both host and device code optimized based on the collected profiles.
Scope
This document describes the design for device-side PGO for HIP (AMDGPU):
- how device counters are instrumented and updated,
- how device profile sections are discovered at runtime,
- how device profile data is copied back to the host and written as a
.profraw, - how that
.profrawbecomes.profdata, and - how the profile is consumed in a subsequent compilation.
The implementation spans:
- LLVM/Clang (instrumentation + driver behavior),
- compiler-rt profiling runtime (device collection/writing).
No CLR/HIP runtime patches are required. The implementation uses only standard HIP APIs (hipGetSymbolAddress, hipMemcpy) available in stock ROCm.
Terms
- CUID (Compilation Unit ID): a unique identifier assigned to each translation unit (TU) during HIP compilation. Since host and device compilations for each TU are separate processes, CUID serves as a coordination mechanism – it allows host code to reference corresponding device symbols (e.g., registering shadow variables for device globals), distinguishes symbols from different TUs after linking (e.g.,
__profc_all_<CUID>), and prevents symbol collisions when multiple TUs define similarly-named functions. - host profile: profile data written for CPU code.
- device profile: profile data written for GPU code (AMDGPU).
- shadow variable: a host-side variable registered with HIP so that
hipGetSymbolAddress()can resolve the corresponding device global. - section shadow variable: a host-side variable registered with HIP for a specific per-TU device section symbol (e.g.
__profc_all_<CUID>). Used to pre-register device memory with CLR beforehipMemcpy. - static-linked kernels: kernels loaded via standard HIP fatbin registration (
__hipRegisterFatBinary/__hip_module_ctor). - dynamic modules: device code loaded/unloaded via
hipModuleLoad*/hipModuleUnload.
High-level architecture
Key device symbol
The LLVM instrumentation pass (InstrProfiling.cpp) creates a per-TU device global (lives in GPU memory):
__llvm_offload_prf_<CUID>: a struct containing device pointers (begin/end) to the profile sections.
The symbol uses CUID suffix to avoid collision when multiple TUs are linked into a single device code object. On the host side, a corresponding shadow variable is registered with HIP during module initialization, allowing the runtime to resolve the device address via hipGetSymbolAddress().
For GPU targets these are per-TU contiguous sections (CUID-suffixed):
| Section | Symbol | Contents | Purpose |
|---|---|---|---|
| Counters | __profc_all_<CUID> |
[N x i64] array |
Execution counts per basic block (expanded for per-wave slots) |
| Data | __profd_all_<CUID> |
Array of __llvm_profile_data records |
Function metadata: name hash, CFG hash, counter count, pointer to counters |
| Names | __llvm_prf_names_<CUID> |
Compressed string blob | Function names for profile matching |
| Uniform counters | __profu_all_<CUID> |
[N x i64] array |
Counts only when all lanes enter block together (for uniformity detection) |
How they are used:
- During instrumented execution, kernel code increments counters in
__profc_all_<CUID>(and__profu_all_<CUID>for uniform entries) - At program exit, the host runtime locates these sections via
__llvm_offload_prf_<CUID> - The runtime copies section data from device to host memory
- The data is written to a
.profrawfile, whichllvm-profdata mergeprocesses - During profile-use compilation, the compiler reads the merged profile and uses the data records to match counters to functions by name/hash
Conceptually (8 pointers, begin pointers first, then end pointers):
struct __llvm_offload_prf_<CUID> {
i8* cnts_begin; i8* data_begin; i8* names_begin; i8* ucnts_begin;
i8* cnts_end; i8* data_end; i8* names_end; i8* ucnts_end;
};
Key host runtime entrypoint
At profile dump time, the host profiling runtime calls:
__llvm_profile_hip_collect_device_data()
That function:
- resolves each TU’s
__llvm_offload_prf_<CUID>on the GPU via HIP APIs, - copies device profile sections back to host memory,
- writes a device
.profrawwith a valid header and relocated pointers.
Data flow (end-to-end)
Phase A: Instrumented build (collect)
- Build HIP program with instrumentation enabled (typical PGO “generate” mode).
- Run workload.
- Host runtime writes host
.profraw. - Host runtime triggers device collection, producing one device
.profrawper TU with TU-indexed device suffix:.amdgcn-amd-amdhsa.<TUIndex>(e.g.,.amdgcn-amd-amdhsa.0,.amdgcn-amd-amdhsa.1)
Phase B: Merge
Use llvm-profdata merge to combine all device .profraw files (from all TUs) into a single .profdata:
llvm-profdata merge -o profile.amdgcn-amd-amdhsa.profdata \
profile.*.amdgcn-amd-amdhsa.*.profraw
Phase C: Profile-use build (optimize)
Rebuild using -fprofile-use=<...>.profdata. Clang/HIP ensures the device compilation uses the device-suffixed profdata.
Compiler-side design (LLVM/Clang)
0) IR Instrumentation: How BB counters are defined (PGO-gen phase)
The first step in PGO is instrumenting the IR to insert counter increments at basic blocks.
Entry point
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp– inserts@llvm.instrprof.incrementintrinsicsllvm/lib/Transforms/Instrumentation/InstrProfiling.cpp– lowers intrinsics to actual counter operations
Step 1: Edge/block selection
FuncPGOInstrumentation::getInstrumentBBs() identifies which blocks need counters using a minimal spanning tree algorithm on the CFG. This minimizes the number of counters while still allowing full block frequency reconstruction.
NumCounters = NumInstrumentedBlocks + NumSelectInsts
Step 2: Intrinsic insertion
For each instrumented block, PGO inserts:
call void @llvm.instrprof.increment(
i8* @__profn_<func>, ; function name
i64 <hash>, ; CFG hash (detects code changes)
i32 <num_counters>, ; total counters for this function
i32 <counter_idx> ; which counter to increment
)
Step 3: Counter array creation
InstrProfiling.cpp lowers the intrinsics and creates:
| Symbol | Type | Purpose |
|---|---|---|
@__profc_<func> |
[N x i64] |
Counter array (N = num_counters) |
@__profd_<func> |
%__llvm_profile_data |
Function metadata (hash, counter ptr, etc.) |
@__profn_<func> |
[M x i8] |
Function name string |
Step 4: AMDGPU-specific counter expansion
For AMDGPU targets, each basic block counter is expanded into multiple slots to alleviate atomic operation contention:
uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
if (TT.isAMDGPU())
NumCounters *= (OffloadNumProfilingThreads + 1); // default: x256
Why this is needed: On GPUs, thousands of waves may execute concurrently. If all waves atomically update the same counter, contention becomes a severe bottleneck. By providing 256 slots per counter and distributing waves evenly across slots, contention is reduced by ~256x.
Step 5: Counter index calculation (AMDGPU)
At runtime, lowerIncrementAMDGPU() maps each wave to a slot:
Slot = (blockHi << 5) | warpLocal
CounterIdx = OriginalCounterIdx x KSlots + Slot
Where:
KSlots = 1 << OffloadProfilingThreadBitWidth(default 256)blockHi= high bits of workgroup IDwarpLocal= wave index within workgroup
This ensures waves are distributed evenly across slots.
Step 6: Wave-aggregated increment
Each wave counts its active lanes, but only the wave leader (first active lane) performs the atomic update:
Value *ActiveMask = Builder.CreateCall(BallotI32Fn, {true}); // which lanes active
Value *NumActive = Builder.CreateCall(CtpopI32Fn, {ActiveMask});
Value *StepTimesActive = Builder.CreateMul(IncStep, NumActive);
// Leader performs: counter[slot] += StepTimesActive
This approach:
- Counts all active lanes accurately – each lane contributes to
NumActivevia the ballot intrinsic - Reduces atomics by 32x (or 64x) – only the leader updates, not every lane
- Distributes across slots – waves map to different slots, reducing contention
Observed overhead: ~30% vs. uninstrumented code in practice.
1) Clang HIP driver: device profile-use filename rewriting
When the user passes -fprofile-use=<file>.profdata on a HIP compilation, the device compilation is given:
-fprofile-instr-use=<file>.amdgcn-amd-amdhsa.profdata
This avoids mixing CPU and GPU profile data in a single filename and makes the device profile consumption explicit.
2) LLVM instrumentation: __llvm_offload_prf_<CUID> creation (device side)
On AMDGPU device targets, LLVM creates per-TU profile data structures with contiguous allocation to ensure reliable section boundaries.
The Linker Ordering Problem
Unlike host PGO which relies on linker-generated __start_/__stop_ symbols, GPU code objects have two problems:
__start_/__stop_symbols are NOTYPE – HIP cannot find them viahipGetSymbolAddress- Linker reorders symbols freely within a section – explicit anchor variables may not end up at section boundaries
For example, if we create __prf_start_cnts before counters and __prf_end_cnts after, the linker may still place them as:
__prf_end_cnts <- Linker placed this first!
__profc_funcA
__profc_funcB
__prf_start_cnts <- Linker placed this last!
Solution: Per-TU Contiguous Allocation
Instead of creating separate global variables for each function’s counters (which the linker can reorder), we allocate one contiguous array per TU for each profile section type.
Per-TU structure:
; All counters in one contiguous array (no reordering possible)
@__profc_all_<CUID> = [TotalCounters x i64] zeroinitializer, section "__llvm_prf_cnts_<CUID>"
; All data entries in one contiguous array
@__profd_all_<CUID> = [NumFunctions x %__llvm_profile_data] zeroinitializer, section "__llvm_prf_data_<CUID>"
; All uniform counters in one contiguous array
@__profu_all_<CUID> = [TotalCounters x i64] zeroinitializer, section "__llvm_prf_ucnts_<CUID>"
; Metadata struct with known boundaries
; Note: the struct field order is standardized as:
; begin pointers first, then end pointers.
@__llvm_offload_prf_<CUID> = {
ptr @__profc_all_<CUID>, ; cnts_begin
ptr @__profd_all_<CUID>, ; data_begin
ptr @__llvm_prf_names_<CUID>, ; names_begin
ptr @__profu_all_<CUID>, ; ucnts_begin
ptr <end of counters>, ; cnts_end (computed: base + size)
ptr <end of data>, ; data_end
ptr <end of names>, ; names_end
ptr <end of uniform counters> ; ucnts_end
}
Key benefits:
| Approach | Symbol Count | Linker Reordering | Boundary Reliability |
|---|---|---|---|
| Separate per-function | N symbols | Can reorder | Unreliable |
| Contiguous per-TU | 1 symbol | N/A (single symbol) | Reliable |
Implementation in InstrProfiling.cpp:
The InstrProfiling pass is a module pass, meaning all instrumentation intrinsics are visible before creating the contiguous arrays:
// Phase 1: Count total counters needed across all functions
void InstrLowerer::lower() {
// First pass: collect all instrprof intrinsics, count total slots
size_t TotalCounters = 0;
for (auto &Inc : IncrementIntrinsics) {
TotalCounters += Inc->getNumCounters() * KSlots;
}
// Allocate single contiguous array
AllCounters = new GlobalVariable(
ArrayType::get(Int64Ty, TotalCounters), ...);
// Second pass: assign offsets to each function
size_t Offset = 0;
for (auto &Inc : IncrementIntrinsics) {
FunctionOffsets[Inc->getName()] = Offset;
Offset += Inc->getNumCounters() * KSlots;
}
}
// Counter access uses offset into contiguous array
Value *getCounterAddress(InstrProfCntrInstBase *I) {
size_t FuncOffset = FunctionOffsets[I->getName()];
size_t CounterIdx = FuncOffset + I->getIndex() * KSlots + Slot;
return GEP(AllCounters, CounterIdx);
}
Profile matching across builds:
The contiguous allocation is an implementation detail for GPU section discovery and does not change the fundamental matching semantics:
- Matching is by function name/hash + counter count, not by GPU memory addresses.
- Offsets/pointers are relocated when writing the
.profrawso the raw reader can recover the intended counter ranges. - Source changes are handled normally – if a function’s CFG changes, its hash changes and profile mismatch warnings apply as usual.
Multi-TU programs:
Each TU produces its own CUID-suffixed device global __llvm_offload_prf_<CUID>, and the host-side ctor injects a distinct host shadow variable per TU and registers it with the runtime. At profile collection time, the runtime iterates all registered shadow variables and resolves each TU’s device global via hipGetSymbolAddress.
TU1: shadow var #0 -> device `__llvm_offload_prf_<CUID1>` -> `__profc_all_<CUID1>`, `__profd_all_<CUID1>`, ...
TU2: shadow var #1 -> device `__llvm_offload_prf_<CUID2>` -> `__profc_all_<CUID2>`, `__profd_all_<CUID2>`, ...
Each TU’s profile is written to a separate file with a TU index suffix (e.g., .0.profraw, .1.profraw), which are then merged with llvm-profdata merge.
CUID (Compilation Unit ID):
For HIP, LLVM first tries to extract a TU-unique CUID from a module global named like __hip_cuid_<hash>. If none is present, LLVM falls back to hashing the module name. This CUID is used to suffix per-TU section names (and a few per-TU symbol names) so they remain distinct across linked TUs.
The implementation includes “keep-alive” techniques so the symbols are emitted/retained by the linker.
3) LLVM instrumentation: host-side registration of __llvm_offload_prf_<CUID>
On the host side (i.e., not the GPU module), LLVM injects code in __hip_module_ctor to:
- call
__hipRegisterVar(...)to register a host shadow variable for the device global__llvm_offload_prf_<CUID> - call
__llvm_profile_hip_register_shadow_variable(host_shadow_ptr)
The shadow-variable mechanism allows the profiling runtime to resolve the device global via:
hipGetSymbolAddress(&dev_ptr, host_shadow_ptr)
Additional: host-side registration of per-TU section symbols (HIP memcpy safety)
To avoid any HSA dependency and keep device->host copies HIP-only, the runtime must ensure
the device section pointers are tracked by CLR. This is achieved by:
- Creating/ registering section shadow variables for the per-TU symbols:
__profc_all_<CUID>,__profd_all_<CUID>,__profu_all_<CUID>, and__llvm_prf_names_<CUID>
- Registering each shadow variable with HIP via
__hipRegisterVar(...) - Registering each shadow variable with compiler-rt via:
__llvm_profile_hip_register_section_shadow_variable(host_shadow_ptr)
At collection time, compiler-rt calls hipGetSymbolAddress on these section shadow variables to
pre-register the underlying device allocations with CLR before using hipMemcpy for section copies.
4) Counter update lowering on AMDGPU
The AMDGPU lowering strategy is wave-aggregated:
- elect a leader lane per wave,
- compute
popcount(activeMask), - update the counter by
step * numActiveusing:- atomic leader mode (default)
- an alternate sampling/overflow mode (legacy/debug experimentation)
This reduces overhead vs. naive per-lane atomic increments.
Runtime-side design (compiler-rt)
1) Dynamic HIP symbol resolution
The ROCm-specific profiling runtime (InstrProfilingPlatformROCm.c) loads HIP at runtime:
dlopen("libamdhip64.so")dlsym()for:hipGetSymbolAddresshipMemcpyhipModuleGetGlobal- (optionally)
hipMemcpyFromSymbol
This keeps compiler-rt decoupled from linking directly against HIP.
2) Static-linked collection via shadow variables
The runtime maintains a list of registered shadow variables:
__llvm_profile_hip_register_shadow_variable(void *ptr)
On collection:
- for each shadow variable, resolve device address via
hipGetSymbolAddress - call
ProcessDeviceOffloadPrf(device_ptr)
Section symbol pre-registration (HIP memcpy-only)
In addition, the runtime maintains a list of section shadow variables:
__llvm_profile_hip_register_section_shadow_variable(void *ptr)
Before copying section payloads, the runtime calls hipGetSymbolAddress() on each of these
shadow variables to ensure CLR tracks the underlying device allocations.
3) Dynamic module support
The runtime tracks dynamically loaded HIP modules:
__llvm_profile_hip_register_dynamic_module(int rc, void **hipModuleOutPtr)- uses
hipModuleGetGlobal(..., "__llvm_offload_prf_<CUID>")to locate the module’s profile global
- uses
__llvm_profile_hip_unregister_dynamic_module(void *hipModule)- triggers
ProcessDeviceOffloadPrf(...)at unload time
- triggers
The LLVM instrumentation inserts these calls adjacent to:
hipModuleLoad*(after successful load)hipModuleUnload(before unload)
4) Writing the device .profraw
ProcessDeviceOffloadPrf():
- copies the 8 pointers in
__llvm_offload_prf_<CUID>device->host viahipMemcpy - computes section sizes
- pre-registers per-TU section symbols via
hipGetSymbolAddress(side effect: CLR memory tracking) - copies sections device->host via
hipMemcpy(HIP-only; no HSA APIs) - writes a
.profrawfile with:- a correct
__llvm_profile_header - relocated
__llvm_profile_data.CounterPtrfields so they are valid offsets within the written file layout
- a correct
Filename convention
The device writer uses the base filename from __llvm_profile_get_filename() (usually derived from LLVM_PROFILE_FILE) and inserts:
.amdgcn-amd-amdhsa.<TUIndex>
Each TU writes to a separate file with a TU index suffix to support multi-TU programs.
Example (single-TU):
- base:
default.profraw - device:
default.amdgcn-amd-amdhsa.0.profraw
Example (multi-TU with 2 TUs):
- base:
default.profraw - TU 0:
default.amdgcn-amd-amdhsa.0.profraw - TU 1:
default.amdgcn-amd-amdhsa.1.profraw
Use llvm-profdata merge to combine profiles from all TUs.
CLR/HIP runtime interactions (no patches required)
The implementation uses only stock HIP APIs – no CLR patches are needed.
1) Why hipGetSymbolAddress is central
The profiling runtime relies on:
hipGetSymbolAddress(&device_ptr, host_shadow_ptr)
This requires that the host shadow variable be registered via __hipRegisterVar(...) during fatbin registration, which is why the LLVM host-side ctor injection is required. This uses standard HIP symbol registration available in all ROCm versions.
2) Verified with stock CLR
E2E tests pass with stock ROCm CLR (tested with a stock ROCm Docker image). No special environment variables or CLR patches are required.
No special environment variables or CLR patches required.
Environment variables / debug knobs
LLVM_PROFILE_FILE: base profile output filename template (host); device output is derived from this.LLVM_PROFILE_VERBOSE=1: enables verbose logging in ROCm profiling runtime.
Profile Reading: NumCounters Expansion
When reading GPU device profraw files, the reader must account for the counter expansion:
The Dual Representation
The numeric values below are an example for a kernel/function with (B=5) base counters (roughly “basic blocks”) and the default (K=256) offload profiling slots:
[
\text{counters in file} = B \times K,\quad \text{and } \texttt{NumOffloadProfilingThreads} = K - 1.
]
| Field | Stored Value | Meaning |
|---|---|---|
Data->NumCounters |
5 | Example base counter count (B) (used for profile matching) |
Data->NumOffloadProfilingThreads |
255 | Example (K-1) where (K) is slots-per-counter (default (K=256)) |
| Actual counters in file | 1280 | Example (B \times K = 5 \times 256) |
Reader Implementation (InstrProfReader.cpp)
The readRawCounts() function must expand NumCounters when reading GPU profiles:
template <class IntPtrT>
Error RawInstrProfReader<IntPtrT>::readRawCounts(InstrProfRecord &Record) {
uint32_t NumCounters = swap(Data->NumCounters);
if (NumCounters == 0)
return error(instrprof_error::malformed, "number of counters is zero");
// For GPU profiles with per-slot counters, the actual number of counter
// entries in the file is NumCounters * (NumOffloadProfilingThreads + 1).
uint16_t NumOffloadThreads = swap(Data->NumOffloadProfilingThreads);
if (NumOffloadThreads > 0)
NumCounters *= (NumOffloadThreads + 1);
// ... read NumCounters values from file
}
Why this is necessary:
createDataVariable()storesNumCounters = B(base count) for profile matching- The actual counter array has
B x Kslots for concurrent wave updates (default (K=256) on AMDGPU) - Without this expansion, the reader would only read
Bcounters, leaving(BxK - B)unread - This would corrupt subsequent data entries and crash during profile processing
Profile Format Version 14
This RFC introduces Version 14 of the indexed profile format (bumped from Version 13).
Note on raw .profraw format versioning: the raw format version remains INSTR_PROF_RAW_VERSION = 10. The new NumOffloadProfilingThreads field is placed to preserve the overall __llvm_profile_data record size (it occupies previously-unused alignment/padding space between NumValueSites[] and NumBitmapBytes). This avoids ABI/layout churn, but correct interpretation of offload profiles requires a toolchain that understands the new semantics (counter expansion + reduction and uniformity gating).
New Fields in Version 14
| Field | Location | Type | Purpose |
|---|---|---|---|
NumOffloadProfilingThreads |
__llvm_profile_data struct |
uint16_t |
Number of per-slot counters (256 default for AMDGPU) |
UniformityBits |
Indexed profile record | uint8_t[] |
1 bit per block indicating uniform (1) or divergent (0) |
Data Record Layout Change
The INSTR_PROF_DATA macro in InstrProfData.inc adds a new field after NumValueSites:
INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
NumOffloadProfilingThreads,
ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
NumOffloadProfilingThreadsVal))
This field is stored in the raw profile (__llvm_profile_data entries) and used by the reader to expand counters correctly.
Indexed Profile Record Extension
When writing indexed profiles, Version 14 adds UniformityBits after BitmapBytes:
[FuncHash: 8 bytes]
[NumCounters: 8 bytes]
[Counts: NumCounters x 8 bytes]
[NumBitmapBytes: 8 bytes]
[BitmapBytes: padded to 8-byte alignment]
[NumUniformityBits: 8 bytes] // NEW in Version 14
[UniformityBits: padded to 8-byte alignment] // NEW in Version 14
[ValueProfData: variable]
Backward Compatibility
| Scenario | Behavior |
|---|---|
| Old reader, new profile | Recognizes Version 14 as unknown, may reject or skip |
| New reader, old profile | Works correctly; UniformityBits not present, defaults to empty |
--write-prev-version |
Writes Version 11 format, skipping UniformityBits for compatibility with older toolchains |
New Section Kind
Added IPSK_ucnts section for uniform counters:
INSTR_PROF_SECT_ENTRY(IPSK_ucnts,
INSTR_PROF_QUOTE(INSTR_PROF_UCNTS_COMMON),
INSTR_PROF_UCNTS_COFF, "__DATA,")
Section names:
- ELF:
__llvm_prf_ucnts - COFF:
.lprfuc$M
Known Limitations
- Performance validation scope: The performance results in this RFC come from synthetic microbenchmarks designed to stress register spilling. Real-world application performance will vary depending on register pressure, branch patterns, and other kernel characteristics. Validation on production HIP applications (e.g., rocBLAS, MIGraphX) is future work.
- Dynamic module multi-TU: Dynamic modules loaded via
hipModuleLoadusehipModuleGetGlobalto find the profile symbol. The current implementation looks for a fixed pattern; multi-TU dynamic modules would require symbol enumeration (future enhancement). - Binary IDs: device
.profrawwriter currently setsBinaryIdsSize = 0(not supported yet). - Value profiling: not supported for device in this prototype (header fields set to 0).
- Dynamic module robustness: dynamic modules are expected to be collected at unload; missing unload processing is warned.
- Partial wave handling: Partial waves (last wave in dispatch) are conservatively marked as divergent.
- Wave32-only uniformity check (current): the current instrumentation uses wave32-specific intrinsics/masks (e.g.
llvm.amdgcn.ballot.i32and0xFFFFFFFFfull-wave mask). Wave64 support is future work. - Uniformity-to-MBB mapping (current): SpillPlacement currently uses an approximation (
ProfileIdx = MBBNum % UniformityStr.size()) because block structure can change after IR profiling (e.g. during ISEL).
Verification checklist
Pipeline sanity
- Run a known offload-PGO workload / microbenchmark with:
- optionally
LLVM_PROFILE_VERBOSE=1for debug output
- optionally
- Confirm both host and device
.profraware created. - Merge:
llvm-profdata merge -o <out>.profdata <in1>.profraw [more...]
- Rebuild with
-fprofile-use=<out>.profdataand confirm:- device compilation consumes
<out>.amdgcn-amd-amdhsa.profdata(driver rewrite).
- device compilation consumes
E2E Verification
Example end-to-end output from a synthetic microbenchmark (designed to stress register spilling):
========================================
PGO Spill E2E Test (high pressure)
========================================
LLVM: <your clang/llvm-profdata toolchain>
Arch: <your target GPU arch>
N=262144, Iterations=10000, Runs=5
[4/7] Merging profile...
Found .unifcnts files (uniformity detection enabled)
NumCounters: 1280, NumOffloadProfilingThreads: 255 # example numbers
_Z15spill_kernel_v3Pdii: 5/5 blocks uniform (from .unifcnts) # example
InstrProfRecord::merge: reduced 1280 slots to 5 counters # example reduction
[7/7] Benchmarking...
Baseline: XXX ms
PGO: YYY ms
Change: ZZ% FASTER/SLOWER
Key verification points (pipeline correctness, not specific speedup values):
NumCounters: 1280– example showing reader expanded (5 \times 256) slots5/5 blocks uniform– example uniformity outputreduced 1280 slots to 5 counters– example merge reduction- PGO build completes successfully with profile consumption
Experimental Validation
Important caveat: The performance results in this section come from synthetic microbenchmarks specifically designed to stress register spilling and validate the uniformity gating mechanism. These results demonstrate that the approach works for its intended purpose, but should not be interpreted as representative of general HIP application performance. Real-world performance will depend on application-specific factors including register pressure, branch patterns, and kernel characteristics.
Microbenchmark Design
The validation microbenchmarks were designed to isolate the spill placement effect by:
- Forcing register spills via tight VGPR limits (
__attribute__((amdgpu_num_vgpr(16)))) - Creating symmetric hot/cold paths with equal register pressure (16 VGPRs each, 32 total needed)
- Varying branch uniformity to test uniform vs. divergent behavior
This design intentionally creates worst-case conditions for the problem we’re solving – kernels where spill placement significantly impacts performance.
Observed Results (Synthetic Benchmarks)
The following results were observed on the synthetic microbenchmarks:
| Test Case | PGO Effect | Notes |
|---|---|---|
| Uniform branches | ~12% faster | All threads spill together → good coalescing |
| Divergent branches (with gating) | ~53-67% faster | Gating prevents regression; other PGO benefits apply |
| Without uniformity gating (IR test) | ~3.7x slower | Demonstrates the problem this RFC addresses |
The “3.7x slower” result without gating validates why uniformity detection is necessary – it shows the regression that can occur when CPU PGO assumptions are applied to divergent GPU code.
Why These Results May Not Generalize
- Register pressure is artificial – real kernels may not have such tight VGPR limits
- Branch structure is symmetric – real kernels often have asymmetric hot/cold paths
- Spill-dominated performance – real kernels have other bottlenecks (memory bandwidth, compute, etc.)
Uniformity Definitions
| Level | Definition | PGO Safe? |
|---|---|---|
| Wave-uniform | All threads in wave take same path (waves may differ) | Yes |
| Workgroup-uniform | All waves in workgroup take same path | Yes |
| Grid-uniform | All threads in grid take same path | Yes |
| Divergent | Threads within wave take different paths | No |
Wave-uniformity is the critical threshold – if all threads in a wave agree, memory coalescing is good regardless of what other waves do.
Note on Counter-Based Heuristics (Design Rationale)
An earlier approach considered using existing per-slot counters to detect uniformity. The AMDGPU counter lowering tracks active lanes:
counter[slot] += step x NumActive // NumActive = popcount(activeMask)
Naive heuristic: If a block’s per-slot counter deltas are always multiples of WaveSize (32 or 64), then each entry likely executed with a full wave active.
// Detection pseudocode
bool isWaveUniform(ArrayRef<uint64_t> slotCounters, unsigned waveSize) {
for (uint64_t count : slotCounters) {
if (count != 0 && count % waveSize != 0)
return false; // Partial wave execution detected
}
return true;
}
Uniformity-Aware PGO (Implemented)
Problem Statement
The counter-sum-based heuristic above is not robust for divergence detection because per-slot counters aggregate over all iterations. This can mask divergence (e.g., partial-wave entries can sum to a multiple of WaveSize). Therefore, we treat it as a fallback/approximation only and implement a dedicated uniform-entry signal.
Solution: Block-Level Divergence Tracking
We implemented a parallel counter array (UniformCounters) that tracks uniform block entries – incremented only when all active lanes enter a block together.
Implementation Overview
1. Instrumentation (InstrProfiling.cpp)
// At each block entry, check if execution is uniform
Value *ActiveMask = Builder.CreateCall(BallotI32Fn, {true});
Value *FullWaveMask = ConstantInt::get(Int32Ty, 0xFFFFFFFF);
Value *IsUniform = Builder.CreateICmpEQ(ActiveMask, FullWaveMask);
// Always increment regular counter
counter[slot] += step x NumActive;
// Conditionally increment uniform counter
if (IsUniform)
uniform_counter[slot] += step x NumActive;
2. Profile Section (InstrProfData.inc)
Added new section kind for uniform counters:
INSTR_PROF_SECT_ENTRY(IPSK_ucnts, "__llvm_prf_ucnts", "__DATA,")
Extended __llvm_offload_prf_<CUID> structure from 6 to 8 pointers, and standardized
field order as begin pointers first, then end pointers (matches runtime):
struct __llvm_offload_prf_<CUID> {
void *cnts_begin;
void *data_begin;
void *names_begin;
void *ucnts_begin; // Uniform counters (NEW)
void *cnts_end;
void *data_end;
void *names_end;
void *ucnts_end;
};
3. Runtime (InstrProfilingPlatformROCm.c)
The runtime:
- pre-registers per-TU section symbols via
hipGetSymbolAddress(CLR tracking) - copies sections device->host via
hipMemcpy(HIP-only; no HSA dependency) - writes uniform counters to a separate
.unifcntsfile:
profile.123.amdgcn-amd-amdhsa.profraw # Regular profile
profile.123.amdgcn-amd-amdhsa.unifcnts # Uniform counters
.unifcnts file format:
| Field | Type | Description |
|---|---|---|
| Magic | uint64 | 0x55434E5450524F46 (“UCNTPROF”) |
| Version | uint64 | 1 |
| NumCounters | uint64 | Number of counter values |
| CountersSize | uint64 | Size in bytes |
| Counters | uint64 | Raw counter data |
4. Profile Merge (llvm-profdata.cpp)
During merge, the tool reads both files and computes per-block uniformity:
// Read .unifcnts file alongside .profraw
std::vector<uint64_t> UniformCounters;
readUniformCountersFile(ProfileFilename, UniformCounters);
// For each block, compute uniformity ratio
for (size_t BlockIdx = 0; BlockIdx < NumBlocks; ++BlockIdx) {
uint64_t TotalCount = sum(Counts[BlockIdx * Slots : (BlockIdx+1) * Slots]);
uint64_t UniformCount = sum(UniformCounters[BlockIdx * Slots : (BlockIdx+1) * Slots]);
// 90% threshold for uniformity
bool IsUniform = (TotalCount == 0) ||
((double)UniformCount / TotalCount >= 0.9);
UniformityBits[BlockIdx] = IsUniform ? 1 : 0;
}
5. Compiler Consumption
Uniformity is propagated to SpillPlacement via function attribute:
// PGOInstrumentation.cpp (PGO-use phase)
F.addFnAttr("amdgpu-block-uniformity", "UUDUU"); // U=Uniform, D=Divergent
// SpillPlacement.cpp
BlockFrequency getAdjustedFrequency(unsigned MBBNum) {
// If -amdgpu-flatten-spill-frequency is enabled: flatten ALL blocks.
//
// Otherwise, if `amdgpu-block-uniformity` is present: flatten only divergent
// blocks and keep PGO frequency for uniform blocks.
//
// Note: current implementation approximates mapping from MBB -> profile block
// index via `ProfileIdx = MBBNum % UniformityStr.size()` due to ISEL block
// splitting/merging; this can be improved by carrying a stable mapping.
}
Test Results
| Kernel | Branch Condition | Uniformity Detection | Block Pattern |
|---|---|---|---|
| Uniform microbenchmark (example) | i % 10000 == 0 |
5/5 blocks uniform | [U, U, U, U, U] |
| Divergent microbenchmark (example) | (gid+i) % 10000 == 0 |
4/5 blocks uniform | [D, D, U, U, U] |
Uniformity Detection Justification
Why Uniformity Gating is Necessary
The uniformity detection mechanism exists to prevent a specific failure mode: PGO-guided spill placement causing regressions on divergent branches.
The mechanism:
- PGO tells regalloc “cold path is rare”
- Regalloc places spills preferentially in cold path
- With divergent branches, cold path still executes (partial waves in SIMT)
- Partial wave scratch access = poor memory coalescing
- Memory latency dominates, causing slowdown instead of speedup
Synthetic microbenchmarks (designed to isolate this effect using IR-level branch_weights metadata) confirm that this regression can be severe on spill-heavy kernels with divergent branches.
Uniformity gating (-amdgpu-flatten-spill-frequency) addresses this by:
- Flattening spill placement frequencies for divergent blocks
- Preserving PGO frequencies for other optimizations (scheduling, layout, etc.)
- Allowing PGO benefits on uniform blocks where spill placement is safe
E2E Test Design
The e2e microbenchmarks (internal tracking) use kernels designed to force register spills via tight VGPR limits:
__attribute__((amdgpu_num_vgpr(16))) // Force spills
void spill_kernel(...) {
// Hot path: 8 doubles = 16 VGPRs
double h0, h1, h2, h3, h4, h5, h6, h7;
// Cold path: 8 doubles = 16 VGPRs
double c0, c1, c2, c3, c4, c5, c6, c7;
for (int i = 0; i < iterations; i++) {
if (CONDITION) {
// HOT PATH (99.99%): update h0-h7
} else {
// COLD PATH (0.01%): update c0-c7
}
}
}
Key design points:
- VGPR limit of 16 cannot fit both paths (16 + 16 = 32 needed) → must spill
- Symmetric structure: both paths have equal register pressure
- Cold path frequency: 1 in 10000 iterations
Branch conditions determine uniformity:
| Test | Branch Condition | Uniformity | Why |
|---|---|---|---|
| Uniform v3 | i % 10000 != 0 |
All threads same | Loop counter i is identical for all threads |
| Divergent v3 | (gid+i) % 10000 != 0 |
Threads differ | gid (thread ID) makes condition thread-dependent |
| Divergent HOT v3 | (gid+i) % 100 != 0 |
Both paths divergent | Higher cold frequency → both blocks entered divergently |
E2E Test Design (Synthetic Microbenchmarks)
The E2E tests use synthetic kernels with the following characteristics:
| Test | Branch Condition | Uniformity | Why |
|---|---|---|---|
| Uniform v3 | i % 10000 != 0 |
All threads same | Loop counter i is identical for all threads |
| Divergent v3 | (gid+i) % 10000 != 0 |
Threads differ | gid (thread ID) makes condition thread-dependent |
| Divergent HOT v3 | (gid+i) % 100 != 0 |
Both paths divergent | Higher cold frequency |
These tests verify that:
- The uniformity detection correctly identifies uniform vs. divergent blocks
- PGO optimization applies successfully without regression on divergent code
- The full pipeline (profile collection, merge, consumption) works end-to-end
Why Divergent Kernels Can Benefit from PGO
Even when SpillPlacement frequencies are flattened for divergent blocks, PGO can still provide benefits.
This is because uniformity gating is surgical – it only affects SpillPlacement:
| Pass | Uses PGO Frequencies? | Gated by Uniformity? |
|---|---|---|
| SpillPlacement | Yes | YES (flattened for divergent blocks) |
| MachineScheduler | Yes | No |
| BasicBlock Layout | Yes | No |
| Loop Transformations | Yes | No |
| Inlining | Yes | No |
| s_clause Grouping | Yes | No |
Benefits still realized for divergent code:
- Better instruction scheduling – MachineScheduler uses PGO frequencies
- Better code layout – Hot code placed in
.text.hot.section (cache benefits) - Better clause grouping –
s_clausesizes improved (0x4 vs 0x3 observed) - Branch probability effects – Compiler knows which path is likely
Heuristic Summary
The current approach uses a simple, safe heuristic:
for each basic block:
if block is DIVERGENT (detected from .unifcnts):
flatten frequency for SpillPlacement
else:
use actual PGO frequency
This is sufficient because:
- Symmetric kernels (balanced pressure): Flattening has no effect (frequencies already similar)
- Asymmetric kernels (cold has higher pressure): Flattening prevents catastrophic regression
- All kernels: Other PGO passes still benefit from frequencies
More sophisticated pressure-aware heuristics were considered but deemed unnecessary:
- The simple heuristic prevents the worst-case regression
- Adding complexity provides marginal benefit
- Safe default is more important for initial upstream
Future Work
-
Enhancements
- Support wave64 targets (gfx9)
- Handle partial waves correctly (capture initial mask at kernel entry)
- Integrate with static uniformity analysis as fallback
-
Profile format improvements
- Consider merging
.unifcntsinto main.profrawformat - Add uniformity to indexed profile format
- Consider merging
References
LLVM Pull Request
Code Entrypoints
LLVM/Clang:
clang/lib/Driver/ToolChains/HIPAMD.cpp– profile-use filename rewritellvm/lib/Transforms/Instrumentation/InstrProfiling.cpp– AMDGPU lowering, uniform counters,__llvm_offload_prf_<CUID>llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp– consumesUniformityBitsand setsamdgpu-block-uniformityllvm/lib/CodeGen/SpillPlacement.cpp– uniformity-gated spill placement (-amdgpu-flatten-spill-frequency)llvm/include/llvm/ProfileData/InstrProfData.inc– profile section definitions (IPSK_ucnts)llvm/include/llvm/ProfileData/InstrProf.h–UniformityBits/NumOffloadProfilingThreadsfields +isBlockUniform()llvm/lib/ProfileData/InstrProfReader.cpp– raw reader expands counters usingNumOffloadProfilingThreadsllvm/lib/ProfileData/InstrProfWriter.cpp– indexed writer serializesUniformityBits(omitted under--write-prev-version)llvm/lib/ProfileData/InstrProf.cpp– merge reduces per-slot counters and carriesUniformityBitsllvm/tools/llvm-profdata/llvm-profdata.cpp–.unifcntsreading and uniformity computation
compiler-rt:
compiler-rt/lib/profile/InstrProfilingPlatformROCm.c– HIP device collection,.unifcntswriting, TU-indexed filenamescompiler-rt/lib/profile/InstrProfilingFile.c– calls__llvm_profile_hip_collect_device_data()
CLR (HIP Runtime) – no changes required:
- Uses stock
hipGetSymbolAddressandhipMemcpyAPIs - Verified working with stock ROCm CLR