Moving towards a singular pointer type

It’s an idea been thrown around in a few different threads (including Rafael’s recent http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141201/247285.html and Chandler’s http://llvm.org/viewvc/llvm-project?rev=226781&view=rev ) so I’m putting up my hand to volunteer to do the work & interested in getting a bit more feedback, thoughts on best approaches, timing, etc.

For some more detail: pointer types (i32*, %foo*, etc) complicate IR canonicalization. store + load should be the same instructions given the same number of bytes stored to memory, but instead we can have store float, store int, etc, etc. Another point Chandler made was that the bitcasts involved when a pointer isn’t of the right type results in extra IR instructions we don’t really need.

So the general idea is that all pointers would just be called “ptr” (pointer? void*?).

Is this something everyone feels is the right direction? Any reason why it wouldn’t be?

Beyond that, I’m trying to think about how to do this & I haven’t hit on a terribly convincing way to do this incrementally. I could introduce the alternative form of “store” that provides the magic pointer type, then set about adding overloads (is that possible?) of any instruction consuming a pointer type, writing the usual LLVM regression tests as I go. Eventually, once this looks like it’s functioning, I could start porting IRbuilder and Clang over to the new store operations & other sources of pointers. Then remove the old stuff.

Are IR instructions overloadable like this? If not, would it be worthwhile to introduce separate names for the typeless-pointer forms (gep_ptr, store_ptr, etc) as a temporary means to have both sets of semantics then rename them all back once the old ones are removed?

Other ideas/thoughts?

  • David

How would GEP's look like in this scheme? Concretely, what would be
the equivalent of

%inner.ptr = getelementptr i8** %x, i32 1

assuming we're doing target independent optimizations and do not know
the size of a pointer?

-- Sanjoy

How would GEP's look like in this scheme? Concretely, what would be
the equivalent of

%inner.ptr = getelementptr i8** %x, i32 1

assuming we're doing target independent optimizations and do not know
the size of a pointer?

Haven't we been moving towards always having at least some basic target
information like datalayout and such (which includes pointer size)?

-- Sean Silva

The difference is that %x would be just a generic pointer type, but the type would still need to be supplied as an argument to the GEP / load / store.

%inner.ptr = getelementptr i8**, %x, i32 1

You could think of it as having an implicit bitcast on every pointer operation.

Actually, store doesn’t need a type, as that’s supplied by the value being stored. But load needs to know the type of the loaded value, and gep needs to know the memory layout to perform target specific offset calculations.

The difference is that %x would be just a generic pointer type, but the type
would still need to be supplied as an argument to the GEP / load / store.

%inner.ptr = getelementptr i8**, %x, i32 1

You could think of it as having an implicit bitcast on every pointer
operation.

That makes a lot of sense, thanks!

-- Sanjoy

It's an idea been thrown around in a few different threads (including
Rafael's recent
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141201/247285.html
and Chandler's http://llvm.org/viewvc/llvm-project?rev=226781&view=rev )
so I'm putting up my hand to volunteer to do the work & interested in
getting a bit more feedback, thoughts on best approaches, timing, etc.

For some more detail: pointer types (i32*, %foo*, etc) complicate IR
canonicalization. store + load should be the same instructions given the
same number of bytes stored to memory, but instead we can have store float,
store int, etc, etc. Another point Chandler made was that the bitcasts
involved when a pointer isn't of the right type results in extra IR
instructions we don't really need.

So the general idea is that all pointers would just be called "ptr"
(pointer? void*?).

Is this something everyone feels is the right direction? Any reason why it
wouldn't be?

Beyond that, I'm trying to think about how to do this & I haven't hit on a
terribly convincing way to do this incrementally. I could introduce the
alternative form of "store" that provides the magic pointer type, then set
about adding overloads (is that possible?) of any instruction consuming a
pointer type, writing the usual LLVM regression tests as I go. Eventually,
once this looks like it's functioning, I could start porting IRbuilder and
Clang over to the new store operations & other sources of pointers. Then
remove the old stuff.

Are IR instructions overloadable like this? If not, would it be worthwhile
to introduce separate names for the typeless-pointer forms (gep_ptr,
store_ptr, etc) as a temporary means to have both sets of semantics then
rename them all back once the old ones are removed?

Other ideas/thoughts?

I'm wondering how geps will work when the pointer type doesn't encode the
stride that each index must take. Maybe it would have to be (strawman
syntax) gep<%T> off of a typeless base pointer?

-- Sean Silva

Actually, store doesn't need a type, as that's supplied by the value being
stored. But load needs to know the type of the loaded value, and gep needs

Yup, agreed. Even now one of types in "store i32 %value, i32* %ptr"
is redundant -- given either the value or the pointer type the type
system fixes the other.

-- Sanjoy

I’m more concerned about the impact on the front end api. It’s sometimes convenient at the moment to allow LLVM to track types, even if only in memory, or only as a hint, and not serialised to/from bitcode.

Perhaps the right first step is to merge bitcasts into each instruction that takes a pointer argument, with the default type taken from the current value. And turn pointer bitcasts into something like a constant expression.

I think we should keep GEP essentially the same, but disassociate the type being GEPd over from the type of the operands. So, assuming the new ptr type is spelled “ptr”, we could use this syntax:
%inner.ptr = getelementptr ptr, ptr %x, i32 1

Or if I was adding 1 to a “struct A*” value in C:
%next_elt = getelementptr %struct.A, ptr %x, i32 1

Ditto for all other instructions that care about pointee types, like load and store:
%v = load i32, ptr %p ; loads already know (and store!) their loaded type internally
store i32 %v, ptr %p ; no need to duplicate that %p points to, we have the type on %v

I don’t think this can be incremental, I think it all goes at once. I think you might need to add a new GEP bitcode opcode, since that instruction grows a new type operand that doesn’t come from an operand type or result type. It also wouldn’t be too hard to accept the old .ll syntax, since the upgrade path mostly discards information.

I think we should keep GEP essentially the same, but disassociate the type
being GEPd over from the type of the operands. So, assuming the new ptr
type is spelled "ptr", we could use this syntax:
%inner.ptr = getelementptr ptr, ptr %x, i32 1

Or if I was adding 1 to a "struct A*" value in C:
%next_elt = getelementptr %struct.A, ptr %x, i32 1

Ditto for all other instructions that care about pointee types, like load
and store:
%v = load i32, ptr %p ; loads already know (and store!) their loaded type
internally
store i32 %v, ptr %p ; no need to duplicate that %p points to, we have the
type on %v

Emphatically agree. No instruction should really change semantics here.
GEPs should keep working the exact same way, the type involved should just
be separate from the pointer's type.

I don't think this can be incremental, I think it all goes at once.

I have some ideas of how to make it incremental:

I think you might need to add a new GEP bitcode opcode, since that
instruction grows a new type operand that doesn't come from an operand type
or result type.

Yep. And you can add this first, defining the semantics to be as-if the
pointer operand was bit casted to a pointer to the new type. Then we could
(in theory, not in practice!) even use and test it with the current IR,
passing an i8* or any other pointer type operand.

Same goes for the load instruction. We could support the new syntax first.

Next, I think you might be able to introduce a generic pointer type,
spelled as 'ptr' which would verifier fail if used with the old load or gep
instructions. You might have to synthesize an unnamable pointee type to use
for the in-memory representation, but that seems not beyond reason. Then
you can wire up all the asm parsing and printing and bitcode stuff
incrementally without any disruption.

The remaining parts are more interesting and maybe harder to do
incrementally, but still seem at least somewhat decomposable:

- switching all of the LLVM optimizer and all of Clang to produce the new
forms of GEP and load rather than the old forms
- switching all of the optimizer and clang to use the new boring pointer
type now that they never form old gep and load instructions
- switching all the auto-upgrade functionality on
- removing the in-memory support for the old forms
- simplifying a ton of the in-memory support and the optimizer now that the
old forms can't show up

Also:

It also wouldn't be too hard to accept the old .ll syntax, since the

upgrade path mostly discards information.

I agree here. If only because of th eregression test suite, and the
*incredible* tediousness of updating the pointers. The auto-upgrade for
this kind of thing is essentially perfect.

-Chandler

I think we should definitely script an update of the test suite if we do
this. I really don't want to deal with reading old style IR in 98% of
existing tests.

Oh, we should. I just don't want to have to lock-step everything.

Also, there are a *lot* of out-of-tree test suites. =] It'd be nice if they
could upgrade when convenient rather than in a hurry.

Or the offsets to gep could all be in bytes and you could introduce two new types of constants: FieldOffset (type, field) and ArrayStride(element type) – The ArrayStride constant would be used with multiply (and would probably have other uses). I’ve seen this work well in other compilers and to be truthful GEP is a little weird for source languages that allow expression extents for aggregates.

It's an idea been thrown around in a few different threads (including
Rafael's recent
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141201/247285.html
and Chandler's http://llvm.org/viewvc/llvm-project?rev=226781&view=rev )
so I'm putting up my hand to volunteer to do the work & interested in
getting a bit more feedback, thoughts on best approaches, timing, etc.

For some more detail: pointer types (i32*, %foo*, etc) complicate IR
canonicalization. store + load should be the same instructions given the
same number of bytes stored to memory, but instead we can have store float,
store int, etc, etc. Another point Chandler made was that the bitcasts
involved when a pointer isn't of the right type results in extra IR
instructions we don't really need.

FWIW: This is actually a real source of annoyance when doing value
numbering/PRE/insertion based stuff.

Current GVN essentially inserts them willy nilly to get the right types in
the right places when playing with loads/stores.
It also spends a bunch of time trying to look through them.

Essentially, it's just another instruction to walk that gives you nothing,
but has to be "worked around" if it ended up in the wrong place, etc.

It's like GCC's NOP_EXPR.

Here's what happened in the GCC world because of this:

grep -i strip_nops *|wc -l
     229

It's essentially littered around the codebase

One reservation about this being “singular” - we are still going to have a different pointer type per address space, right?

I imagine/assume that distinct pointer types per address space are still necessary, but I haven’t looked at address spaces in any particular detail.

We probably have the weirdest pointer requirements (an out-of-tree target with pointers that are not integers[1]). I don't believe that this would cause problems for us, as long as we can differentiate pointers in different address spaces.

Being able to strip out all of the casts to and from i8* that various bits of clang insert because certain intrinsics take an i8* and so pointers must be cast and cast back would simplify things a lot for us (in particular, clang often picks an i8* in the wrong AS - it wouldn't if there were no bitcast at all).

I wonder what the effects would be on TBAA? Would pointers still have some metadata associated with them to indicate what their types were, or would this just be attached to the corresponding load / store instructions?

David

[1] Not sure when slides and videos will appear for this, but hopefully soon: FOSDEM 2015 - Adventures with LLVM in a magical land where pointers are not integers

From: llvmdev-bounces@cs.uiuc.edu [mailto:llvmdev-bounces@cs.uiuc.edu]
On Behalf Of David Chisnall

[..]

[1] Not sure when slides and videos will appear for this, but hopefully soon:
FOSDEM 2015 - Adventures with LLVM in a magical land where pointers are not integers

The slides can be found here:

http://www.llvm.org/devmtg/2015-02/

(hidden in the 'meetings' link on the left of the website (The LLVM Compiler Infrastructure Project))

Greetings,

Jeroen Dobbelaere

I’m not sure we need the new instructions at all.

Really what we want is for Load and GEP to carry around a Type* alongside all their operands. This can be added now to the existing Load and GEP insts, and to their constructors/create methods.

Going through all of the codebase to ensure we pass in the Type* in all the right places was going to have to be done anyway for a new instruction, but now can be done on the existing instructions. You can also assert (temporarily) to ensure that the type passed in matches the pointer operand type.

I do agree the actual final switch could be tricky, but I think we should do what I’ve suggested here first, then see what happens if we just make the switch (which if we get everything in place really just means turning off a bunch of asserts and parser error checks).

Personally I think this is far less work and less disruptive to out of tree targets, which are otherwise going to have to find all their own uses of the old style GEP instruction and add another GEP instruction to switches, etc.

I agree in principle, but I think the precedent here is dangerous.

Thanks,
Pete