How to handle size_t in front ends?

For the most part, it appears that writing a front end can be almost entirely platform-independent. For example, my front end doesn't know how big a pointer is, and for the most part doesn't care. All of the platform-specific aspects of compilation have, up to this point, been hidden behind the IR.

However, where things get tricky is in calling C functions that take a size_t as a parameter. Since a size_t is the same size as a pointer, it may be either 32 or 64 bits depending on the target platform. All of a sudden, I can no longer generate 'target-neutral' IR, but instead must introduce into the front end knowledge about the target platform.

So my questions are:

1) Is there a way to declare an integer type in the IR that represents "an int the same size as a pointer" without specifying exactly the size of a pointer?

2) Assuming the above answer is "no", what's the best way for the front end to measure the size of a pointer?

-- Talin

For the most part, it appears that writing a front end can be almost
entirely platform-independent. For example, my front end doesn't know
how big a pointer is, and for the most part doesn't care. All of the
platform-specific aspects of compilation have, up to this point, been
hidden behind the IR.

Nice.

However, where things get tricky is in calling C functions that take a
size_t as a parameter. Since a size_t is the same size as a pointer, it
may be either 32 or 64 bits depending on the target platform. All of a
sudden, I can no longer generate 'target-neutral' IR, but instead must
introduce into the front end knowledge about the target platform.

Right, this is ugliness of C. :frowning: Remember that even the size of int (and certainly the size of long!) is target specific as well. This implies that you cannot generate truly portable interfaces to C code if they take an int.

So my questions are:

1) Is there a way to declare an integer type in the IR that represents
"an int the same size as a pointer" without specifying exactly the size
of a pointer?

No.

2) Assuming the above answer is "no", what's the best way for the front
end to measure the size of a pointer?

It doesn't help for interfacing to C, but you can do things like this:
http://llvm.org/docs/tutorial/LangImpl8.html#offsetofsizeof

If you want to generate a constant as a Value* in the IR.

For emitting the right sized integer type, I'm afraid that your front-end needs to know the size of the integer.

If you are willing to accept an ugly hack that works almost always on common targets, you can just declare the size_t argument *as being a pointer*, and use inttoptr to pass it. This works because common ABIs pass integers and pointers the same way.

-Chris

Chris:

There are other languages that specify a "word" type along these lines.
Would it be worth considering adding such a type to the IR, or is there
a reason not to do so that I am failing to see?

shap

What would this be used for? How is it defined? How does arithmetic work on it?

-Chris

What it would be used for is emitting platform-neutral IR for languages
that require a platform-sensitive integral type. BitC, for example,
explicitly specifies all integer sizes other than word, and specifies
that word is an integral value having at least as many bits as the
native pointer type. BitC does not permit (at the source level)
inter-operation between Word and other integral types, but that can be
dealt with entirely as a restriction in the front-end.

The reason that we introduced type Word, aside from systems-level
requirements, was that the type of array and vector indices cannot be
stated without this. The index type needs to be something that can span
an arbitrary character vector. It doesn't make sense for that type to be
i32 on a 64-bit machine, but i32 isn't big enough on a 64-bit machine.
Conversely it doesn't make sense for it to be i64 on a 32-bit machine.

Meaning: on a machine having 32-bit registers, iWord is a type treated
by the IR as indistinguishable from i32. On a machine having 64 bit
registers, iWord is the a type treated by the IR as indistinguishable
from i64. Arithmetic works in the usual way. If "iWord" is "i32" on your
target, then it is acceptable in any position and condition where "i32"
would be acceptable in the IR specification. In short, iWord can be
substituted for the appropriate integral type the instant you commit to
a particular target.

In practice, I don't really think that a magic type of this sort is
necessary unless we want to have the capability for an
architecture-neutral IR. Assuming we know the target architecture, we
already know what size to emit. But it would be pleasant to be able to
emit and pickle machine-independent IR for BitC.

shap

There are other languages that specify a "word" type along these lines. Would it be worth considering adding such a type to the IR, or is there a reason not to do so that I am failing to see?

What would this be used for? How is it defined? How does arithmetic work on it?

Looking up the intptr type via TargetData is not a significant issue for me, but I can see the appeal, and how its absence could constitute a significant barrier to generating portable IR (provided, of course, a portable language). Regardless, it would allow me to hardcode a good deal more codegen if the LLVM IR had an intptr type. The semantics I would imagine for an intptr type are:

• Lowered to i32 or i64 for code generation.
• Treated an ordinary integer for all operations except casts.
• Can be the operand to ptrtoint, but not the result.
• Can be the result of inttoptr, but not the operand.
• Can be bitcast to an actual pointer type.
• Whether sext, zext, and trunc are applicable, I could be convinced either way. It muddies the semantics of these operations.

My comment wasn't a solution it was a value judgement on the proposed solution.

This doesn't seem like what we are after for BitC. In the BitC case,
WORD is not an integer type that contains a pointer value. It is an
integer type that is guaranteed to describe an arbitrary vector index.
The size dependency is a consequence of the fact that not all address
spaces are the same.

We allow Word <-> integer conversion through explicit conversion
operators, but the language spec does not allow Word to be intermixed
with other integral types for arithmetic operations.

shap

What would this be used for? How is it defined? How does
arithmetic work on it?

Looking up the intptr type via TargetData is not a significant issue
for me, but I can see the appeal, and how its absence could constitute
a significant barrier to generating portable IR (provided, of course,
a portable language). Regardless, it would allow me to hardcode a good
deal more codegen if the LLVM IR had an intptr type. The semantics I
would imagine for an intptr type are:

Querying TargetData only works if you know the size of the pointer. :slight_smile:

• Lowered to i32 or i64 for code generation.

Ok

• Treated an ordinary integer for all operations except casts.

Ok. What does this mean for add? This basically means that an intptr add cannot have usefully defined semantics. Can you give an example of when it is useful?

• Can be the operand to ptrtoint, but not the result.
• Can be the result of inttoptr, but not the operand.

I assume these are backwards. intptr_t is an integer, not a pointer.

• Can be bitcast to an actual pointer type.

No. int <-> ptr is done with inttoptr and ptrtoint.

• Whether sext, zext, and trunc are applicable, I could be convinced
either way. It muddies the semantics of these operations.

Right.

-Chris

Querying TargetData only works if you know the size of the pointer. :slight_smile:

Yes. For BitC purposes, querying TargetData would be sufficient as long
as we don't care whether the emitted IR is neutral w.r.t. pointer size.
Given this, I think that introducing an iWord type is not yet
sufficiently well motivated from the BitC perspective.

But it would sure be convenient if we could query TargetData at compile
time to determine the target pointer size. Not essential, by any means,
but it seems unnecessary to encode the knowledge redundantly (in both
the IR layer and the front end).

In the end, the use case that concerns me is things like character
vectors, because of the fact that the index spans depend on the address
space size. I'm not clear whether it is a goal to have an IR that is
capable of being a neutral representation w.r.t. address space size. If
it *is* a goal, then I don't see how to do it without some form of
iIntPtr or iWord type, but I'm still very new to all this.

> • Lowered to i32 or i64 for code generation.

Ok

> • Treated an ordinary integer for all operations except casts.

Ok. What does this mean for add? This basically means that an intptr add
cannot have usefully defined semantics. Can you give an example of when
it is useful?

I had the same reaction. If it is lowered, then it should work normally
for add. It is not quite as useless as you suggest, because things of
the form add iIntPtr x iIntPtr -> iIntPtr will still work correctly
after lowering is performed. I also see no reason why casts should be
excluded at the IR level. That seems to me like a front end issue. At
the IR level iIntPtr is just an late-bound integral type like any other.

Perhaps Mike and I are thinking about unrelated things.

> • Can be the operand to ptrtoint, but not the result.
> • Can be the result of inttoptr, but not the operand.

I assume these are backwards. intptr_t is an integer, not a pointer.

I agree, but Mike was consistent enough here that I wondered if I had
failed to understand what he was after properly.

> • Whether sext, zext, and trunc are applicable, I could be convinced
> either way. It muddies the semantics of these operations.

These seem important in order to allow explicit conversions to the
normal integer types.

shap

Querying TargetData only works if you know the size of the pointer. :slight_smile:

In the end, the use case that concerns me is things like character
vectors, because of the fact that the index spans depend on the address
space size. I'm not clear whether it is a goal to have an IR that is
capable of being a neutral representation w.r.t. address space size. If
it *is* a goal, then I don't see how to do it without some form of
iIntPtr or iWord type, but I'm still very new to all this.

i64 should be big enough for this. Just use i64.

> ÿÿ Treated an ordinary integer for all operations except casts.

Ok. What does this mean for add? This basically means that an intptr add cannot have usefully defined semantics. Can you give an example of when it is useful?

I had the same reaction. If it is lowered, then it should work normally
for add. It is not quite as useless as you suggest, because things of
the form add iIntPtr x iIntPtr -> iIntPtr will still work correctly
after lowering is performed. I also see no reason why casts should be
excluded at the IR level. That seems to me like a front end issue. At
the IR level iIntPtr is just an late-bound integral type like any other.

i64 is available now. When you inttoptr and i64 to a pointer on a 32-bit system, it takes the low bits.

-Chris

On a 32-bit platform, doesn't one want to use i32?

That was the point that I was trying to make -- on a 32-bit platform one
wants to use i32, while on a 64-bit platform one wants to use i64. If
one is trying to generate neutral IR, there really isn't a "right"
choice here.

shap

Why? What is wrong with i64?

-Chris

On its face, the problem is that it doesn't fit in a native register...
or is there something here that I am failing to understand?

The code generator deletes dead upper parts. For example:

int test(long long a, long long b) {
   return a+b; // 64-bit add, taking low 32-bits
}

compiles to a single add, even on x86-32:

_test:
   movl 12(%esp), %eax
   addl 4(%esp), %eax
   ret

-Chris

I understand the reduction that you are doing. I like it, but this
reduction often is not possible for vector accesses (the case at hand)
when the index argument to the "vector get" primitive is defined as an
i64. Even when the operations can be compiled away, the reduction does
not address the difference in storage overhead and the implications for
structure layout differences when iWord fields appear within structs.

Your reduction will work wonderfully on simple for loops that iterate
over vectors -- even, with care, in doacross situations -- but it won't
help at all when vector indices are stored into variables for later use
elsewhere in the program (that is: the array cursor pattern). In
critical systems codes, this pattern is very common, which is why I'm
concerned about it.

shap

What would this be used for? How is it defined? How does arithmetic work on it?

Looking up the intptr type via TargetData is not a significant issue for me, but I can see the appeal, and how its absence could constitute a significant barrier to generating portable IR (provided, of course, a portable language). Regardless, it would allow me to hardcode a good deal more codegen if the LLVM IR had an intptr type. The semantics I would imagine for an intptr type are:

Querying TargetData only works if you know the size of the pointer. :slight_smile:

Exactly. :slight_smile: I'm going to play devil's advocate here for a moment. intptr would tidy up my own output a smidgen, but I do have other target dependencies, so it's of no great concern to me.

But I could see how someone wanting LLVM bitcode to play the role of Java bytecode or MSIL might find it important or even essential. And the question has come up many times.

I can also see how this is entirely useless in C and thus less than interesting. :slight_smile:

• Treated an ordinary integer for all operations except casts.

Ok. What does this mean for add?

Sure. %x = add intptr %a, %b is semantically identical to:

%tmp1 = bitcast intptr %a to i8*
%tmp2 = getelementptr i8* %tmp1, intptr %b
%x = bitcast i8* %tmp2 to intptr

Or, put another way, it's an i32 add on a 32-bit host and an i64 add on a 64-bit host.

This basically means that an intptr add cannot have usefully defined semantics.

How do you figure? I consider getelementptr to have usefully defined semantics, even though they are target-dependent. :slight_smile:

Can you give an example of when it is useful?

Sure, grep for getIntPtrType.

But seriously, any situation where a front-end language would use size_t, ptrdiff_t, System.IntPtr, a value in a tagged object model, etc… it could use this type instead of conditionally selecting i32 or i64. This is not applicable to Java or C, which either have no such pointer-sized integer type, or have no portable representation. But it would be applicable to many other languages that do.

The advantage provided is improved portability of bitcode and (very slightly) reduced complexity in front-end compilers. I don't consider these overwhelming advantages, given that bitcode is pretty non-portable as-is.

• Can be the operand to ptrtoint, but not the result.
• Can be the result of inttoptr, but not the operand.

I assume these are backwards. intptr_t is an integer, not a pointer.

They are not.

• Can be bitcast to an actual pointer type.

No. int <-> ptr is done with inttoptr and ptrtoint.

No. These cast behaviors are unique semantics.

Let me be more explicit. To be useful, an intptr type would need conversions to and from both fixed-width integer types and pointers. It's not necessary to overload existing casts. If we chose to, the casts applicable to pointers are closer matches than the casts applicable to integers, semantically. This is because they correctly reflect the potential data loss between the fixed-width integer type and the target-dependent type.

== Pointer conversions ==
For pointer conversions, bitcast has the correct semantics.

void *p;(void *) (ptrdiff_t) p; // This is a no-op on every platform.(void *) (int32_t) p; // This is target-dependent and could truncate.

Pointer-to-intptr-to-pointer conversions can be condensed or eliminated in the same way that bitcasts between pointer types can. By contrast, inttoptr(ptrtoint) cannot be converted to a bitcast or noop because if the integer type is smaller than the pointer type, the conversion is lossy.

== Conversions to fixed-width integer types ==

     size_t ip;
     (uint16_t) ip;
     (uint32_t) ip;

This has the same semantics as ptrtoint: Depending on the target, it could be an extend or a truncate or a noop.

     size_t ip;
     ssize_t sip;
     (uint64_t) ip;
     (int64_t) sip;

However, signed intptr types do exist, so it's quite arguable that sign extension behavior should not be fixed as it is in ptrtoint and gep sign extension.

For Ocaml, this might be beneficial, actually; a great many ptrtoint and inttoptr operations occur due to the tagged object model. Since these are lossy casts, it might be beneficial if they could be recognized target-independently as no-ops.

== Conversions from fixed-width integer types ==

     int16_t s, int32_t i, int64_t l;
     (size_t) s;
     (size_t) i;
     (size_t) l;

Same issues as with conversions to fixed-width integer types:
• inttoptr is a better match for the semantics.
• But sign extension behavior should be controllable.

On a 32-bit platform, doesn't one want to use i32?

Why? What is wrong with i64?

Lots of things, actually.

It doesn't have the proper semantics for arithmetic. As a concrete example, System.IntPtr.operator/ in .NET is quite distinct from either System.Int32.operator/ or System.Int64.operator/.

Nor does it have the correct size in memory or as an argument, although converting to a pointer is a usable workaround in both cases. Likewise, alignment.

Finally, computing 64-bit intermediate results on 32-bit platforms in order to preserve unwanted i64 semantics is quite undesirable. Consider this, a reasonable sort of thing to compute with an intptr:

     int f(void *p, void *q, int i, int j) {
       size_t ip = (size_t) p, iq = (size_t) q;
       return (iq - (ip + i)) / j;
     }

If size_t is defined as int64_t and sizeof(void*) = 4, the divide must be computed in 64-bits (even though the high portion will be discarded) in order to preserve semantics in the uninteresting case that iq - (ip + i) > 0xFFFFFFFFU. Now imagine a target without a 64-bit divider. :slight_smile: I guess each intermediate result could be cast back to a pointer and then back to an integer, but that seems unlovely.

It would also make the human-readable format more readable by actual humans, which might be the best reason of all for adding some kind of pointer-integer type. It could make the work of maintaining and improving language front ends a little easier. Every little bit there helps.

On a related topic: The source-level debugging descriptors require you
to know up front what the sizeof pointer types are. Is there any hope of
the frontend remaining blissfully unaware of platform details?

I really don't know how to do this. The current debug info stuff depends on emitting size info into the IR. At this point, I don't think there is a good way around this. Improvements to the design are welcome of course.

-Chris

Nice.

However, where things get tricky is in calling C functions that take a
size_t as a parameter. Since a size_t is the same size as a pointer,
it
may be either 32 or 64 bits depending on the target platform. All of a
sudden, I can no longer generate 'target-neutral' IR, but instead must
introduce into the front end knowledge about the target platform.

Right, this is ugliness of C. :frowning: Remember that even the size of int
(and certainly the size of long!) is target specific as well. This
implies that you cannot generate truly portable interfaces to C code
if they take an int.

So my questions are:

1) Is there a way to declare an integer type in the IR that represents
"an int the same size as a pointer" without specifying exactly the
size
of a pointer?

No.

2) Assuming the above answer is "no", what's the best way for the
front
end to measure the size of a pointer?

It doesn't help for interfacing to C, but you can do things like this:
http://llvm.org/docs/tutorial/LangImpl8.html#offsetofsizeof

If you want to generate a constant as a Value* in the IR.

For emitting the right sized integer type, I'm afraid that your front-
end needs to know the size of the integer.

If you are willing to accept an ugly hack that works almost always on
common targets, you can just declare the size_t argument *as being a
pointer*, and use inttoptr to pass it. This works because common ABIs
pass integers and pointers the same way.

-Chris

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-Chris