This is an RFC for ClangIR. It deals with address space, an important feature for both languages and targets, with a wide range of users. We seek a broader audience from different address space users within the Clang community.
Authors: @jopperm @v.lomuller @seven-mile
Glossary
AS, addrspace
Abbreviations for “address space”.
default address space (of a language)
Most languages do not care about address spaces. Even those that do define a default behavior for the “absence of AS qualifier”, which usually selects one AS as the default.
Background
There are several abstractions of address spaces adopted by Clang FE, LLVM IR, and MLIR GPU Dialect.
addrspace(x)
construct in LLVM IR
This is simply an integer attached to other IR constructs. LLVM IR carries most target-specific semantics on this integer, interpreted by the backend. The goal is to make few assumptions about it while not missing general optimization opportunities that every target would benefit from, like null checks.
There is a great talk from a past LLVM Dev Mtg that covers every detail you need.
Address space design in MLIR gpu
dialect
This models the common hierarchical memory spaces of GPGPU: Private, WorkGroup, and Global. All GPGPU-oriented languages have equivalents of these.
clang::LangAS
In Clang FE, address spaces are represented as an enumeration clang::LangAS
. It is originally used to encode the address space qualifiers from the source code.
It includes:
- A
Default
case, representing essentially “absence of AS qualifier”. It can also have different semantics depending on the context. - Many language-specific address spaces like
opencl_local
andcuda_shared
. - A target-specific region starting from
FirstTargetAddressSpace
, which fits the need of__attribute__((address_space(x))
and provides an “escape hatch” for various needs from special targets.
The design works well with clang Sema. However, it causes two main issues when lowering to CIR:
1. Different treatments of default AS for OpenCL C / CUDA / SYCL
Most addrspace-aware languages map their default AS to LangAS::Default
, which is ideal. For example, there are no cuda_generic
or sycl_generic
cases in LangAS
because they are both actually LangAS::Default
, a language-agnostic case.
However, the default AS of OpenCL C switched from private
to generic
starting with CL 2.0. Besides, the pointers to automatic variables need a special treatment. Consider the following example:
void func() {
int *foo; // <- This is a pointer to generic AS
int bar;
&bar; // <- But this is a pointer to private AS
}
As a result, the frontend does not use the Default
case but adds both opencl_private
and opencl_generic
. When deducing addrspace, both factors, the CL-std version and corner case for auto vars, are taken into consideration to attach the correct address space qualifier to pointer types.
OpenCL gets it done in Sema, but CUDA and SYCL choose to resolve the correct AS in CodeGen, which leads to the implementation in CIRGen in the future.
2. Duplication of addrspace cases that are actually the same
We don’t need to distinguish between opencl_local
and cuda_shared
. These language-specific cases produce extra noise in ClangIR. Moreover, when lowering to the MLIR std gpu
dialect, we eventually have to merge these anyway.
Target-specific alloca address space
The TargetCodeGenInfo
of original clang CodeGen provides a virtual method called getASTAllocaAddressSpace()
. In most targets, it returns LangAS::Default
, meaning the allocated variable does not carry any address space qualifier. In SPIR and AMDGPU, this method returns the alloca address space (an integer) encoded in the target data layout to align semantics.
This target-specific aspect needs addressing when we split original clang CodeGen into a long pipeline in ClangIR. We should have two representations: one before the “TargetLowering” pass and another after it.
Our Approach
We propose a unified address space design for ClangIR to model what clang::LangAS
aims to represent, but in a clear and extensible way.
Overview
The conversion pipeline of the proposed AS can be figured as follows:
Clang to CIR AS mapping CIR to LLVM AS mapping
v v
Clang language-specific AS -> CIR Unified AS (ours) -> LLVM Target-specific AS
| ^
| |
---------------------- "target" case ----------------
Merge all duplications as much as possible
If something is language-agnostic, we will merge them into a single case. For example:
opencl_local
,sycl_local
,cuda_shared
→gpu_shared
If there is no duplication in some language-specific cases, they are still acceptable, e.g., wasm_funcref
.
Redefine Default
case semantics
In ClangIR, we don’t really care about questions like “is this address space qualifier present?”. Because it requires us to further think about the actual semantics of its absence. We should just define the semantic as an individual enum case.
Add a special target-agnostic case alloca
To provide precise semantics for alloca pointers and defer the target-specific bit from CIRGen, we add a special enum case encoded as alloca
.
Keep the target-specific region as an “escape hatch”
This keeps the design generic enough to cover future needs. Note that the original Clang pipeline is already doing some of this. We are making it better.
It will also help with incremental implementation, as discussed in the section “Implementation”.
Proposed final address space mapping
Here we propose a design of Clang to CIR AS mapping, which naturally gives the design of Unified AS and its conversion into LLVM AS.
Some entries in the mapping behave differently for different languages:
- For non-offloading languages,
Default
→None
- For CUDA and SYCL,
Default
→gpu_private
orgpu_generic
(depending on the result of resolution in CIRGen)
The remaining have a static definition:
opencl_generic
→gpu_generic
opencl_global
,sycl_global
,opencl_global_host
,sycl_global_host
,opencl_global_device
,cuda_device
,sycl_global_device
→gpu_global
opencl_local
,cuda_shared
,sycl_local
→gpu_local
opencl_constant
,cuda_constant
→gpu_constant
opencl_private
,sycl_private
→gpu_private
target<x>
→target<x>
- Extra address space for alloca:
alloca
And some hints for the future design, if we ever get to it:
hlsl_groupshared
→gpu_local
(it shares the same target AS withopencl_local
, defined inDirectX.h
)*
→*
(passthrough for Microsoft and WASM stuff)
Attribute design
The implementation of this proposal is still based on the PR clangir#682, which is a one-to-one modeling from LangAS.
- Encodes the
None
case as a null attribute. - Uses an integer-parameterized attribute
AddressSpaceAttr
to hide all implementation details and ensure memory efficiency. - The conversions between LangAS and text-form CIR would be modified as proposed above.
Open Questions
Possibly better handling for alloca
The original proposed individual alloca
case probably yields extra address space conversions from itself to None
or gpu_private
, because LangAS::Default
will be actually mapped to them.
Here is an alternative method for the design of alloca
: rewrite the both current getASTAllocaAddressSpace
implementations (the default one and the one querying data layout) to return unified AS, which will be None
and gpu_private
correspondingly.
We are unsure about which one is better: neutral but generates extra address space conversion vs opinionated but no extra conversion.