[RFC][ClangIR] Unified Address Space Design in ClangIR

This is an RFC for ClangIR. It deals with address space, an important feature for both languages and targets, with a wide range of users. We seek a broader audience from different address space users within the Clang community.

Authors: @jopperm @v.lomuller @seven-mile

Glossary

AS, addrspace

Abbreviations for “address space”.

default address space (of a language)

Most languages do not care about address spaces. Even those that do define a default behavior for the “absence of AS qualifier”, which usually selects one AS as the default.

Background

There are several abstractions of address spaces adopted by Clang FE, LLVM IR, and MLIR GPU Dialect.

addrspace(x) construct in LLVM IR

This is simply an integer attached to other IR constructs. LLVM IR carries most target-specific semantics on this integer, interpreted by the backend. The goal is to make few assumptions about it while not missing general optimization opportunities that every target would benefit from, like null checks.

There is a great talk from a past LLVM Dev Mtg that covers every detail you need.

Address space design in MLIR gpu dialect

This models the common hierarchical memory spaces of GPGPU: Private, WorkGroup, and Global. All GPGPU-oriented languages have equivalents of these.

clang::LangAS

In Clang FE, address spaces are represented as an enumeration clang::LangAS. It is originally used to encode the address space qualifiers from the source code.

It includes:

  • A Default case, representing essentially “absence of AS qualifier”. It can also have different semantics depending on the context.
  • Many language-specific address spaces like opencl_local and cuda_shared.
  • A target-specific region starting from FirstTargetAddressSpace, which fits the need of __attribute__((address_space(x)) and provides an “escape hatch” for various needs from special targets.

The design works well with clang Sema. However, it causes two main issues when lowering to CIR:

1. Different treatments of default AS for OpenCL C / CUDA / SYCL

Most addrspace-aware languages map their default AS to LangAS::Default, which is ideal. For example, there are no cuda_generic or sycl_generic cases in LangAS because they are both actually LangAS::Default, a language-agnostic case.

However, the default AS of OpenCL C switched from private to generic starting with CL 2.0. Besides, the pointers to automatic variables need a special treatment. Consider the following example:

void func() {
  int *foo; // <- This is a pointer to generic AS
  int bar;
  &bar; // <- But this is a pointer to private AS
}

As a result, the frontend does not use the Default case but adds both opencl_private and opencl_generic. When deducing addrspace, both factors, the CL-std version and corner case for auto vars, are taken into consideration to attach the correct address space qualifier to pointer types.

OpenCL gets it done in Sema, but CUDA and SYCL choose to resolve the correct AS in CodeGen, which leads to the implementation in CIRGen in the future.

2. Duplication of addrspace cases that are actually the same

We don’t need to distinguish between opencl_local and cuda_shared. These language-specific cases produce extra noise in ClangIR. Moreover, when lowering to the MLIR std gpu dialect, we eventually have to merge these anyway.

Target-specific alloca address space

The TargetCodeGenInfo of original clang CodeGen provides a virtual method called getASTAllocaAddressSpace(). In most targets, it returns LangAS::Default, meaning the allocated variable does not carry any address space qualifier. In SPIR and AMDGPU, this method returns the alloca address space (an integer) encoded in the target data layout to align semantics.

This target-specific aspect needs addressing when we split original clang CodeGen into a long pipeline in ClangIR. We should have two representations: one before the “TargetLowering” pass and another after it.

Our Approach

We propose a unified address space design for ClangIR to model what clang::LangAS aims to represent, but in a clear and extensible way.

Overview

The conversion pipeline of the proposed AS can be figured as follows:

              Clang to CIR AS mapping          CIR to LLVM AS mapping
                           v                        v
Clang language-specific AS -> CIR Unified AS (ours) -> LLVM Target-specific AS
        |                                                     ^
        |                                                     |
        ----------------------  "target" case  ----------------

Merge all duplications as much as possible

If something is language-agnostic, we will merge them into a single case. For example:

  • opencl_local, sycl_local, cuda_sharedgpu_shared

If there is no duplication in some language-specific cases, they are still acceptable, e.g., wasm_funcref.

Redefine Default case semantics

In ClangIR, we don’t really care about questions like “is this address space qualifier present?”. Because it requires us to further think about the actual semantics of its absence. We should just define the semantic as an individual enum case.

Add a special target-agnostic case alloca

To provide precise semantics for alloca pointers and defer the target-specific bit from CIRGen, we add a special enum case encoded as alloca.

Keep the target-specific region as an “escape hatch”

This keeps the design generic enough to cover future needs. Note that the original Clang pipeline is already doing some of this. We are making it better.

It will also help with incremental implementation, as discussed in the section “Implementation”.

Proposed final address space mapping

Here we propose a design of Clang to CIR AS mapping, which naturally gives the design of Unified AS and its conversion into LLVM AS.

Some entries in the mapping behave differently for different languages:

  • For non-offloading languages, DefaultNone
  • For CUDA and SYCL, Defaultgpu_private or gpu_generic (depending on the result of resolution in CIRGen)

The remaining have a static definition:

  • opencl_genericgpu_generic
  • opencl_global, sycl_global, opencl_global_host, sycl_global_host, opencl_global_device, cuda_device, sycl_global_devicegpu_global
  • opencl_local, cuda_shared, sycl_localgpu_local
  • opencl_constant, cuda_constantgpu_constant
  • opencl_private, sycl_privategpu_private
  • target<x>target<x>
  • Extra address space for alloca: alloca

And some hints for the future design, if we ever get to it:

  • hlsl_groupsharedgpu_local (it shares the same target AS with opencl_local, defined in DirectX.h)
  • ** (passthrough for Microsoft and WASM stuff)

Attribute design

The implementation of this proposal is still based on the PR clangir#682, which is a one-to-one modeling from LangAS.

  • Encodes the None case as a null attribute.
  • Uses an integer-parameterized attribute AddressSpaceAttr to hide all implementation details and ensure memory efficiency.
  • The conversions between LangAS and text-form CIR would be modified as proposed above.

Open Questions

Possibly better handling for alloca

The original proposed individual alloca case probably yields extra address space conversions from itself to None or gpu_private, because LangAS::Default will be actually mapped to them.

Here is an alternative method for the design of alloca: rewrite the both current getASTAllocaAddressSpace implementations (the default one and the one querying data layout) to return unified AS, which will be None and gpu_private correspondingly.

We are unsure about which one is better: neutral but generates extra address space conversion vs opinionated but no extra conversion.

4 Likes

CC @bcardosolopes @lanza for their opinions

1 Like

Naming them as gpu_ may be a little bit narrow. OpenCL can run on FPGA or NPU. In Clang for similar situations either device_ or offload_ has been used. offload_ is more favorable. OpenCL can be treated as a special case of offloading where device code and host code are separate.

2 Likes

The fact that these are called opencl is unfortunate, I don’t think it’s correct to call it offload either because these are properties of the target themselves. OpenCL isn’t an offloading language, and you can directly compile C/C++ just fine (just do clang --target=nvptx64-nvidia-cuda). I really don’t know a compelling reason we couldn’t just call it addrspace_generic and if a target needs a new one it can just add a new enum member.

I am not an address space expert. How is this design interact with the OpenMP and OpenACC dialect and the future OpenCL, Cuda, HIP, Sycl, … dialects?

1 Like

Thanks for putting this up! I already gave some feedback as part of the PRs, but overall this seems like a good plan (after you incorporate feedback given by other experts). Perhaps @AnastasiaStulova also has some extra feedback for you.

Looks like this would go a bit against your unified nice design

This might be better instead.

1 Like

Thanks for working on this. This proposal generally makes sense.

Just a few comments to explain the language specific differences below.

OpenCL is very explicit about the address spaces. Initially no conversions b/w address spaces were allowed and every type must have an address space. This ensures support for diverse range of HW that may required special instructions to access certain memories and may not support dynamic address translation. To simplify implementation clang’s Default address space was used as a synonym to OpenACL’s private. This was partially because of the inference rules that assumed an address space to be private if it was not present explicitly in the source code. Later this lead to some issues because generic address space was introduced which permitted conversions with some other address spaces and also was inferred implicitly in some cases. So Default in some cases became generic but the parser was not fully fixed to add the address space explicitly in the types where needed. Overall there shouldn’t be any uses of Default address space in OpenCL. Aside from inference rules and deviation in default address space, OpenCL strictly follows rules defined in Embedded C ISO/IEC TR 18037:2008. Note that OpenCL 3.0 makes generic address space optional.

In CUDA address space conversions are permissive with Default and therefore there are no complex inference rules during parsing.

Having unified address spaces in IR generally makes sense but it also means the lowering happens early. This might be reasonable depending on the type of optimisations in CIR that affect address spaces.

I don’t have a lot of experience with alloca address space but I always felt it was meant to be some form of a language-specific address spaces address space. I feel it could be a sunset or even identical to OpenCL’s private. Maybe it could be added as C/C++ address space for automatic storage? This will need some work in Clang parser though.

2 Likes