RFC: GEP as canonical form for pointer addressing

RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.

Corollaries
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)

I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.

Background & Motivation

We want to support precise garbage collection(1) in LLVM. To do so, we have written a pass which inserts safepoints, read, and write barriers as appropriate. This pass needs to be able to reliably(2) identify pointer vs non-pointer values. Its advantageous to run this pass as late as practical in the optimization pipeline, but we can schedule it before lowering begins (i.e. before CodeGenPrepare).

We control the initial IR which is generated and can ensure that it does not contain any inttoptr instructions. We're looking to have a guarantee(*) that a random LLVM optimization pass will not decide to replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to reason about.

* "guarantee" isn't really the right word here. I'm really just looking to make sure that the community is comfortable with GEPs as canonical form. If some pass decides to insert inttoptr instructions into otherwise clean IR, I want some assurance a patch fixing that would stand a good chance of being accepted. I'm happy to do any cleanup required.

In addition to my own use case, here's a few others which might come up:
- Backends for targets which support different operations on pointers vs integers. Examples would be some of the older mainframe architectures. (There'd be a lot more work needed to support this.)
- Various security related applications (e.g. CFI w.r.t. function pointers)

I don't really want to get into these applications in detail, mostly because I'm not particularly knowledgeable on those topics. I'd appreciate any other applications anyone wants to throw out, but lets try to keep from derailing the discussion. (As I did to Nick's original thread on DataLayout. :))

Notes:
1) We're not using the existing gc.root implementation strategy. I plan on explaining why in a lot more detail once we're closer to having a complete implementation that we can upstream. That should be coming relatively shortly. (i.e. months, not weeks, not years)

2) As Nick pointed out in a separate thread, other types of typecasts can obscure pointer vs integer classifications. (i.e. casting the base type of a pointer we then load through could load a field of the "wrong" type") I plan on responding to his point separately, but let's leave that out of this discussion for the moment. Having GEPs as canonical form is a step forward by itself, even if I decide to propose something further down the road.

Philip

From: "Philip Reames" <listmail@philipreames.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, February 14, 2014 7:18:21 PM
Subject: [LLVMdev] RFC: GEP as canonical form for pointer addressing

RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form
for
pointer addressing in LLVM IR before CodeGenPrepare.

Is this not already the case? I did not think that any passes introduce inttoptr+arithmetic+inttoptr prior to CGP. On the other hand, we don't convert inttoptr+arithmetic+inttoptr into GEP when we can (which is PR14226 -- where Eli (cc'd) said this is unsafe in general).

-Hal

If it is, my proposal is even less controversial than I’d hoped it would be. :slight_smile: However, given the number of folks who I’ve talked to about this who haven’t said so, I expecting the answer is no. To my knowledge, none currently do. I want to keep it that way. To be clear, this is somewhat separate from my proposal. I only care that inttoptr+arithmetic+inttoptr sequences aren’t inserted in the place of GEPs. Having said that…, it’s still an interesting point to discuss. I read through the PR and I have to admit I don’t understand why the pointer aliasing rules would prevent such a transform. The final GEP is already “based on”* the base of the original GEP. The transformation doesn’t effect that at all. Can you (or Eli) spell this out a bit for me? I’m missing something. * from “A pointer value formed by an is on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value.” in the LangRef Philip

RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.

Corollaries
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)

I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.

FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes. Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?

-Andy

Not directly related, but our canonical form for loops involving pointers[1] turns a loop that contains a GEP with the loop induction variable into a GEP with the increment inside the loop. This has two annoying properties for code generation:

- The GEP with the induction variable as the offset maps cleanly to CPU addressing modes and so we generate better code if we don't do this canonicalisation, and therefore end up trying to undo it in the back end (yuck).

- If the source is the start of an object, then this behaviour is GC-hostile because it means that IR that contains a pointer to an object start now only contains a pointer to the middle, requiring the GC to deal with inner pointers.

It would be nice if we could have canonical forms such that if the front end ensures that there are no inner pointers without pointers to the object's start in the IR, the optimisers don't break this.

David

[1] Are canonical forms actually documented anywhere, or are they simply undocumented implicit contracts?

I would say whatever form is currently generated by IR passes is defined as canonical. It’s not easy to specify. At some points in the pipeline (early and late) it’s fine to permit multiple forms of the same expression as long as it’s canonical-enough for the downstream analysis.

If some pass is generating a suboptimal form, it’s good to question whether it’s really necessary for any analysis. If not, we should change it. Without a test case, I can’t say what issue you’re running into above.

-Andy

RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.

Corollaries
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)

I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.

FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes.

You could phrase it this way. I would push for "everything before CodeGenPrepare", but am open to counter argument on why a smaller set should be selected. :slight_smile:

Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?

Yes. The only slightly complication is dealing with phi nodes and selects (which feed into GEPs), but assuming you're willing to accept a slightly conservative answer, it's definitely doable.

I have a pass locally which effectively does this. It's a side effect of it's primary purpose, but getting that extracted as a distinct pass (or part of the verifier) shouldn't be difficult.

To correct myself…

This was in fact one of the corollaries I listed in my original proposal. I can drop it without effecting my core use case. This is what I should have said rather than the above.

Philip

While I agree that from a stylistic point of view this would be an improvement, we don't actual *need* this to support precise GC. It would definitely result in cleaner code generation than our current scheme though.

Philip

David, do you happen to have a test case on hand? I know I've seen this before, but my attempt to write out a quick example from memory failed.

Philip

I don't have one to hand - it's something I see in things with fairly complex loop structures. I'll try to find one next time I'm chasing performance issues.

David

If you’re running LSR, all bets are off. When Philip says he wants canonical IR before CodeGenPrepare, I take that to mean anything scheduled in TargetPassConfig::addIRPasses, including LSR, constant hoisting, etc.

-Andy

I’m not opposed to preserving the GEP’s original base as a matter of convention when there’s no good reason not to. But, in general passes expect to be able to break-up GEPs into smaller steps. We can’t guarantee that the original base will be directly referenced at every point of use.

We should certainly avoid generating out-of-bounds GEPs without retaining some in-bounds pointer, because that would break everyone’s conservative GC as well.

-Andy

This is exactly what I meant. Thanks for the clarification. Philip

I’m not opposed to preserving the GEP’s original base as a matter of convention when there’s no good reason not to. But, in general passes expect to be able to break-up GEPs into smaller steps. We can’t guarantee that the original base will be directly referenced at every point of use.

Agreed. From my perspective, having the direct base improves code quality, but is not required for correctness.

We should certainly avoid generating out-of-bounds GEPs without retaining some in-bounds pointer, because that would break everyone’s conservative GC as well.

I have no objection to this, but it's not required either. We can reconstruct a base pointer using a custom transform pass and then preserve this into the GC stack map. This deals with both in bounds and out of bounds geps. It does result in more messy code being generated though.

Philip

Philip Reames <listmail <at> philipreames.com> writes:

RFC: GEP as canonical form for pointer addressing

<snip>

In addition to my own use case, here's a few others which might come up:
- Backends for targets which support different operations on pointers vs
integers. Examples would be some of the older mainframe architectures.
(There'd be a lot more work needed to support this.)

Philip

It's not just old mainframes, it's some of the newest architecture as well.
The Mill general-purpose architecture (http://ootbcomp.com) has non-integer
pointers and distinct pointer operations too. That LLVM loses pointerhood is
the biggest problem that we have identified while looking into using LLVM as
our supported compiler. It may be a killer, and we may have to fall back to
gcc. That would be a shame, but it does appear that the ir makes rash
assumptions about machine architecture.

"There'd be a lot more work needed to support this" is not encouraging to
see, especially for a startup company with limited resources and little
prior exposure to LLVM internals.

Ivan

Just to add, I spend a fair bit of my time in the computer architecture research community, and pointers that are not integers are an increasingly common model. They simplify various dependency analysis paths in the pipeline (giving fewer pipeline flushes) and make certain kinds of security features significantly easier to implement.

Architectures that separate pointers from integers are note becoming rarer. They are increasingly common in application-specific processors and likely to reappear in mainstream processors over the next 5-10 years.

We have managed to get LLVM working (and building nontrivial amounts of code) on a MIPS-derived architecture that has non-integer pointers, and the representation in the IR itself is fine. We have a few hacks in optimisations that are far too coarse grained (i.e. don't do this optimisation if you're dealing with this kind of pointer, even though many of them [SCEV in particular] should work but the code makes invalid assumptions). We do end up having to add more after every merge.

We start to hit problems when we get to SelectionDAG, which makes a lot of assumptions about the underlying architecture and has an annoying habit of thinking it knows better than the back end and undoing transformations that the back end has done.

David

P.S. The Mill is a very interesting architecture, but I'm very glad I'm not the one responsible for instruction scheduling on it...

We have managed to get LLVM working (and building nontrivial amounts of code) on a MIPS-derived architecture that has non-integer pointers, and the representation in the IR itself is fine. We have a few hacks in optimisations that are far too coarse grained (i.e. don't do this optimisation if you're dealing with this kind of pointer, even though many of them [SCEV in particular] should work but the code makes invalid assumptions). We do end up having to add more after every merge.

Any chance you'd be willing to share patches? Or even just a list of optimizations effected? This is work I'm likely be duplicating in the very near future.

We start to hit problems when we get to SelectionDAG, which makes a lot of assumptions about the underlying architecture and has an annoying habit of thinking it knows better than the back end and undoing transformations that the back end has done.

We looked at trying to preserve pointer vs integer information post SelectionDAG and quickly gave up. I believe this to be the right long term direction - i.e. years from now - but we didn't believe it would be viable in the near term. Instead, we've chosen to encode the information we actually need - which values to rewrite - at an earlier phase and construct the IR such that - we hope - nothing can insert uses after our insert safepoints.

The fact you've gotten this working all the way though is an impressive accomplishment and gives me hope for the long term direction.

Philip

Our LLVM and Clang repositories are here:

https://github.com/CTSRD-CHERI/llvm
https://github.com/CTSRD-CHERI/clang

(the cheri branch in both is currently the active one, but it will be renamed head soon - we've just made some changes to our ISA and are waiting for everyone to have the updated version before we sync everything). I'm in the process of cleaning up the MIPS IV support for upstreaming, and I'm happy to upstream anything else that is more generally useful. I haven't yet for two reasons:

- Without an architecture that has a pointer-integer distinction in tree (and a lot of tests!), the support will likely bit rot. Our architecture does not yet have a stable ISA (and, since it's a research platform, probably never will), so would not be a good choice.

- Lots of things have '//FIXME: This is a really ugly hack!' above them and, while they do make things work for us, they are not always the right approach for long-term support.

I would be interested in working with anyone who wants to get better support for this kind of architecture into the architecture-neutral parts of LLVM. Having the address space cast instruction has simplified things for us quite a bit (we still actually lower this to an inttoptr or ptrtoint in the back end, but at least the optimisers don't randomly elide the casts or break things anymore, because they assume that ptrtoint -> inttoptr is a bitcast, even if they ended up in different address spaces [which had different sizes]).

David