RFC: GEP as canonical form for pointer addressing
I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare.
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.)
I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.
Background & Motivation
We want to support precise garbage collection(1) in LLVM. To do so, we have written a pass which inserts safepoints, read, and write barriers as appropriate. This pass needs to be able to reliably(2) identify pointer vs non-pointer values. Its advantageous to run this pass as late as practical in the optimization pipeline, but we can schedule it before lowering begins (i.e. before CodeGenPrepare).
We control the initial IR which is generated and can ensure that it does not contain any inttoptr instructions. We're looking to have a guarantee(*) that a random LLVM optimization pass will not decide to replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to reason about.
* "guarantee" isn't really the right word here. I'm really just looking to make sure that the community is comfortable with GEPs as canonical form. If some pass decides to insert inttoptr instructions into otherwise clean IR, I want some assurance a patch fixing that would stand a good chance of being accepted. I'm happy to do any cleanup required.
In addition to my own use case, here's a few others which might come up:
- Backends for targets which support different operations on pointers vs integers. Examples would be some of the older mainframe architectures. (There'd be a lot more work needed to support this.)
- Various security related applications (e.g. CFI w.r.t. function pointers)
I don't really want to get into these applications in detail, mostly because I'm not particularly knowledgeable on those topics. I'd appreciate any other applications anyone wants to throw out, but lets try to keep from derailing the discussion. (As I did to Nick's original thread on DataLayout. :))
1) We're not using the existing gc.root implementation strategy. I plan on explaining why in a lot more detail once we're closer to having a complete implementation that we can upstream. That should be coming relatively shortly. (i.e. months, not weeks, not years)
2) As Nick pointed out in a separate thread, other types of typecasts can obscure pointer vs integer classifications. (i.e. casting the base type of a pointer we then load through could load a field of the "wrong" type") I plan on responding to his point separately, but let's leave that out of this discussion for the moment. Having GEPs as canonical form is a step forward by itself, even if I decide to propose something further down the road.