[RFC] Changing X86 data layout for address spaces

Amy, when you say "implement MSVC's extensions", is that MSVC/LLVM or
are you going to add these to clang? OpenVMS has dual-sized pointers as
well and want to see them in clang. One of my engineers noticed that
address space 4 was a 32-bit pointer address space (or at least it used
to be). I haven't seen any formal discussion of that over in the cfe
dev list. I'm happy to help out where we can.


Yes, the address spaces are intended to be generally useful for other applications. Glad to hear that you would find it useful. :slight_smile:

For now we were only going to add them for x86, but it should be easy to add such address spaces to other targets.

Please don’t bake in knowledge in LLVM of a address space unless there’s a strong need. Clang can assign whatever meaning to an AS it wishes, and their properties are configurable via data layout configuration.

The standards of review are much lower for a Clang only change as it can be revised arbitrarily without breaking other uses of LLVM. Building in first class knowledge to LLVM itself is a breaking change (potentially) for other frontends.


Clang can assign whatever meaning to an AS it wishes, and their properties are configurable via data layout configuration.

I think Clang checks that its data layout matches the LLVM target data layout, so I’m not sure how to go about implementing the address spaces as a change in Clang.

The other issue I was running into is how to do the address space casts in clang - my current thought was to lower the address space casts in LLVM to the sign extension / zero extension / truncation.


That is basically what we do today to provide mixed sized pointers with
our legacy frontends. They generate IR to our old code generator which
has ADDR32 and ADDR64 datatypes. We use a 64-bit address data layout
and then typecast the 32-bit forms to/from the underlying 64-bit
addresses. I have been warned that such rampant typecasting might
interfere with certain optimizations or TBAA data. We haven't
investigated that yet. Since clang doesn't go through our IR-to-IR
converter, we'll have to teach clang to do the same.

Thanks for the info–
It seems like the way to do this is for clang to use address spaces to represent different sized pointers in the IR. If that’s the case, then LLVM still has to know about those specific address spaces. I don’t see a clear way to avoid adding knowledge of the address spaces to LLVM.

I think that’s the approach we’ll go with, unless there are more objections.

Please review the properties of an address space which are configurable via the data layout. For example, bitwidth is one of those parameters. If that parameter space covers your needs, then you do not need LLVM side support.


The datalayout itself is currently not considered a point of configurability. It’s a static, backend/codegen defined property. The target machine machine returns the datalayout for the given triple and is the source of truth. The front end is responsible for creating a module with an exactly matching datalayout and is not free to customize the datalayout this way. I’m not sure what there is to gain by avoiding putting this information in the backend.


I agree with everything here, but did you mean frontend instead of backend in the last sentence? As a consequence of the fact that each target defines its data layout, that means we have to put the new data layout in the backend and the frontend if we are going to use address spaces.

We could implement this purely in the frontend with ptrotoint / inttoptr, zext, sext, trunc, i32, i64 etc, but the IR for it is quite awful and unoptimizable. It breaks several invariants and the extension seems to be naturally representable with address spaces.

I was addressing the concern about making an llvm backend change for this feature. From my perspective these are the same thing. There is one datalayout that should be consistent everywhere, regardless of the implementation details of how clang sets it on the module. This should definitely use a different sized address space. I don’t think there should be any real problem changing the datalayout, though there may need to be some long overdue datalayout upgrade work (e.g IIRC llvm-link warns about mismatched datalayouts based on a dumb string comparison)


Looks like I made my point in an accidentally really confusing way. Let me try again w/Matt’s correction in mind.

I want to make sure that the middle end optimizer code is driven by the data layout. I am not trying to express an opinion on how that data layout is populated (frontend, backedge, black magic, what have you). I just want to make sure that we don’t end with the middle end having to know that address space “56” has special meaning beyond what is encoded in the data layout. To say it differently, I want to make sure that a different frontend targeting a different backend is not effected by the proposed changes.


Got it. :slight_smile: Yes, we have no intention of teaching the middle end about these address spaces. The usual rules around addrspacecast should apply.

I really hate to dip my toe in here, because it will only reveal my total ignorance, but….

Do the “usual rules around addrspacecast” say when different address spaces can alias? I remember somebody using address spaces to represent something like special off-to-the-side device memory, which obviously could never alias main memory, whereas other uses like the 32/64-bit thing will certainly have the 32-bit space aliasing (likely disjoint parts of) the 64-bit space, and the exact mapping of 32-to-64 might vary across OSes.


By default, nothing. (i.e. everything across AS is mayalias) But individual AS pairs may define alternate aliasing rules. I don’t know that we have a good way to make that plugable today though.

(forgot to llvm-dev…)

Would it make sense to have TTI answer queries about whether two address spaces may alias? This is something I’ve asked myself in the past, but I’m not sure if it follows the philosophy of what TTI should do.

(forgot to llvm-dev…)

Would it make sense to have TTI answer queries about whether two address spaces may alias? This is something I’ve asked myself in the past, but I’m not sure if it follows the philosophy of what TTI should do.

We do have the ability for targets to add custom AA logic, and the AMDGPU backend uses this to add some address-space-based logic (see AMDGPUAliasAnalysis.cpp). If it were just adding the AS-based aliasing logic, then the ratio of boilerplate to actual logic would certainly speak in favor of a TTI hook. The AMDGPU AA also overrides pointsToConstantMemory and that seems more involved. Regardless, for now, we could have an X86AliasAnalysis and insert that into the pipeline.


Sorry, this thread disappeared into my email filters.

So to clarify/summarize, the proposed change here is to change the x86 backend data layout to encode the pointer sizes for these three address spaces. It seems like there’s general agreement that this is fine, as long as there aren’t other changes to the ‘middle end’–