Two important changes to the getelementptr instruction

Hi all,

I just checked in a series of patches that makes some substantial changes
to the LLVM getelementptr instruction. In particular, in LLVM 1.2 and
earlier, the getelementptr instruction required index operands for
structure types to be 'ubyte' constants and index operands for sequential
types to be 'long' values. This had several problems, most notably that
it was impossible to represent a structure with more than 256 fields, and
performance for 32-bit targets suffered.

Now in CVS, structure indices are required to be 'uint' constants, which
permits structures with up to 2^32 elements. Sequential type indices are
allowed to be any int, uint, long, and ulong values. This is documented
in the LLVM language reference here:
http://llvm.cs.uiuc.edu/docs/LangRef.html#i_getelementptr

This change required modifying the encoding of getelementptr instruction
in bytecode files, so LLVM 1.3 bytecode files won't be readable by LLVM
1.2 and earlier. LLVM .ll files and .bc files from earlier releases, as
usual, will work fine with the newer version. In particular, we will
always support .ll files that use ubyte indexes for structure types, as
the parser auto-upgrades them to uint indexes.

This change fixes the following bugs:
http://llvm.cs.uiuc.edu/PR82
http://llvm.cs.uiuc.edu/PR309

It also results in cleaner and simpler LLVM code, allowing many integer
casts that were present for array indexes to go away. Though the encoding
of getelementptr instructions is slightly less efficient than before
(because we have to store what type the various sequential type operands
are), the reduction in cast instructions actually results in a bytecode
size reduction of about 1%.

If you have any questions about this change, please let me know,

-Chris

Hi Chris,

Congrats on getting this taken care of finally. I know its something
you've wanted to do since 1.0.

I have one question. How does LLVM disambiguate between a uint used for
a structure and a uint used for an array? My assumption is that LLVM is
aware of the type of the thing being indexed all the way through the
dereference so it doesn't really matter what index type is being used.
If that's the case, then why restrict structure indexes to uint and not
make it "any integer type"? This might further reduce casting and
generalizes the programming model.

Mostly just curious ...

Reid.

Congrats on getting this taken care of finally. I know its something
you've wanted to do since 1.0.

It certainly has been on the grand TODO list for a long time :slight_smile:

I have one question. How does LLVM disambiguate between a uint used for
a structure and a uint used for an array?

It depends on the operand number of the getelementptr instruction. For
example:

%P2 = getelementptr {int}* %P, uint 4

In the above example, we know that the uint is indexing "through" the
pointer in the getelementptr instruction, so it is not an index into the
structure. In this case:

%P3 = getelementptr {int}* %P, uint 4, uint 0

The first index is just like the previous case, but the 'uint 0' is now
indexing through the structure, so the number is a field number. The big
distinction is that you can index with variables in arrays and pointers,
but only constant integers in structures.

My assumption is that LLVM is aware of the type of the thing being
indexed all the way through the dereference so it doesn't really matter
what index type is being used.

Very true.

If that's the case, then why restrict structure indexes to uint and not
make it "any integer type"? This might further reduce casting and
generalizes the programming model.

The idea is that structure indices must be *constants*. As such, it
doesn't really matter what type we make them, as long as there is enough
"addressing space" to represent enough fields. Allowing structure fields
to be "any type" is possible, but wouldn't give us anything: uint 7 means
the same thing as sbyte 7.

The disadvantage of allowing any integer type for the structure index is
that we have to encode the integer type used in the bytecode file, which
takes space. Right now the type of the index is implicitly encoded into
the bytecode file based on what operand of the getelementptr it is.
Another advantage is that clients of the LLVM IR can assume that all
structure indexes are instances of ConstantUInt, which simplifies
construction and modification of the IR a bit.

-Chris

.....

The big
distinction is that you can index with variables in arrays and pointers,
but only constant integers in structures.

Doh! That is a big distinction. Thanks for reminding me.

...
> If that's the case, then why restrict structure indexes to uint and not
> make it "any integer type"? This might further reduce casting and
> generalizes the programming model.

The idea is that structure indices must be *constants*. As such, it
doesn't really matter what type we make them, as long as there is enough
"addressing space" to represent enough fields. Allowing structure fields
to be "any type" is possible, but wouldn't give us anything: uint 7 means
the same thing as sbyte 7.

The disadvantage of allowing any integer type for the structure index is
that we have to encode the integer type used in the bytecode file, which
takes space. Right now the type of the index is implicitly encoded into
the bytecode file based on what operand of the getelementptr it is.
Another advantage is that clients of the LLVM IR can assume that all
structure indexes are instances of ConstantUInt, which simplifies
construction and modification of the IR a bit.

Wow, its just as if you've thought this stuff through :slight_smile:

You are, of course, right. Its simpler for language implementers to just
deal with ConstantUInt and simpler for LLVM too. Since they have to be
constants, there's no point allowing other types. When I made that
suggestion I was think any integer expression could be used to index a
structure field but that doesn't make sense, as you've pointed out.

Thanks for satisfying my curiosity.

Reid.