inline asm semantics: output constraint width smaller than input

The below patch is to build the kernel for x86_64, with the attached
.config,
using llvm-gcc (trunk, with patch from
2989 – weak attribute improper handled when return type is pointer type).

The .config has KVM turned off, because I didn't know how to change
x86_emulate.c so that LLVM builds it
(3373 – [dragonegg] cannot tie inline asm operands of different widths)
For 32-bit some more changes are required.

The resulting kernel image are of the same size
$ ls -l vmlinux.patched
-rwxr-xr-x 1 edwin edwin 11277206 2009-01-24 17:58 vmlinux.patched
$ ls -l vmlinux
-rwxr-xr-x 1 edwin edwin 11277206 2009-01-24 18:01 vmlinux

They aren't identical though, a disassembly shows that the address of
most of functions changed,
also some register assignments changed (r14 instead of r15, and so on).

Are these changes correct, and are they acceptable?

Best regards,
--Edwin

.config-llvm-64 (52.5 KB)

>>>
>>>
>> i'd not mind it at all if the kernel could be built with other open-source
>> compilers too.
>>
>> Now in this case the patch you suggest might end up hurting the end result
>> so it's not an unconditional 'yes'. But ... how much it actually matters
>> depends on the circumstances.
>>
>> So could you please send a sample patch for some of most common inline
>> assembly statements that are affected by this, so that we can see:
>>
>> 1) how ugly the LLVM workarounds are
>>
>>
>
> Ok, I will prepare a patch for both cases.
>
>
>> 2) how they affect the generated kernel image in practice
>>
>> My gut feeling is that it's going to be acceptable with a bit of thinking
>> (we might even do some wrappers to do this cleanly) - but i'd really like
>> to see it before giving you that judgement.
>>

The below patch is to build the kernel for x86_64, with the attached
.config,
using llvm-gcc (trunk, with patch from
2989 – weak attribute improper handled when return type is pointer type).

The .config has KVM turned off, because I didn't know how to change
x86_emulate.c so that LLVM builds it
(3373 – [dragonegg] cannot tie inline asm operands of different widths)
For 32-bit some more changes are required.

The resulting kernel image are of the same size
$ ls -l vmlinux.patched
-rwxr-xr-x 1 edwin edwin 11277206 2009-01-24 17:58 vmlinux.patched
$ ls -l vmlinux
-rwxr-xr-x 1 edwin edwin 11277206 2009-01-24 18:01 vmlinux

They aren't identical though, a disassembly shows that the address of
most of functions changed,
also some register assignments changed (r14 instead of r15, and so on).

Are these changes correct, and are they acceptable?

Best regards,
--Edwin

---
arch/x86/include/asm/uaccess.h | 10 ++++++----
arch/x86/lib/delay.c | 2 +-
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 69d2757..28280de 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -154,7 +154,7 @@ extern int __get_user_bad(void);

#define get_user(x, ptr) \
({ \
- int __ret_gu; \
+ unsigned long __ret_gu; \
     unsigned long __val_gu; \
     __chk_user_ptr(ptr); \
     might_fault(); \
@@ -176,7 +176,7 @@ extern int __get_user_bad(void);
         break; \
     } \
     (x) = (__typeof__(*(ptr)))__val_gu; \
- __ret_gu; \
+ (int)__ret_gu; \
})

#define __put_user_x(size, x, ptr, __ret_pu) \
@@ -239,11 +239,13 @@ extern void __put_user_8(void);
  */
#define put_user(x, ptr) \
({ \
- int __ret_pu; \
+ __typeof__(*(ptr)) __ret_pu; \

This does not look right. We can sometimes have put_user() of non-integer
types (say structures). How does the (int)__ret_pu cast work in that case?
We'll fall into this branch in that case:

        default: \
                __put_user_x(X, __pu_val, ptr, __ret_pu); \
                break; \

and __ret_pu has a nonsensical type in that case.

     __typeof__(*(ptr)) __pu_val; \
     __chk_user_ptr(ptr); \
     might_fault(); \
     __pu_val = x; \
+ /* return value is 0 or -EFAULT, both fit in 1 byte, and \
+ * are sign-extendable to int */ \
     switch (sizeof(*(ptr))) { \
     case 1: \
         __put_user_x(1, __pu_val, ptr, __ret_pu); \
@@ -261,7 +263,7 @@ extern void __put_user_8(void);
         __put_user_x(X, __pu_val, ptr, __ret_pu); \
         break; \
     } \
- __ret_pu; \
+ (int)__ret_pu; \
})

#define __put_user_size(x, ptr, size, retval, errret) \
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index f456860..12d27f8 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -112,7 +112,7 @@ EXPORT_SYMBOL(__delay);

inline void __const_udelay(unsigned long xloops)
{
- int d0;
+ unsigned long d0;

     xloops *= 4;
     asm("mull %%edx"

Is this all that you need (plus the 16-bit setup code tweaks) to get LLVM
to successfully build a 64-bit kernel image?

If yes then this doesnt look all that bad or invasive at first sight (if
the put_user() workaround can be expressed in a cleaner way), but in any
case it would be nice to hear an LLVM person's opinion about roughly when
this is going to be solved in LLVM itself.

  Ingo

  

#define put_user(x, ptr) \
({ \
- int __ret_pu; \
+ __typeof__(*(ptr)) __ret_pu; \
    
This does not look right. We can sometimes have put_user() of non-integer
types (say structures).

I didn't encounter it with my .config, but it is certainly possible.
I think using __builtin_choose_expr would be better than the switch

See a new patch at the end of this mail, using __builtin_choose_expr.
[vmlinux size still same]
It also includes some 32-bit stuff, but that is not complete yet.

How does the (int)__ret_pu cast work in that case?
  
It would fail at compile time, with an error message that you can't cast
aggregates to ints, so my patch is not good.

We'll fall into this branch in that case:

        default: \
                __put_user_x(X, __pu_val, ptr, __ret_pu); \
                break; \

and __ret_pu has a nonsensical type in that case.
  
That branch is a call to a non-existent function __put_user_X, and
should give error at link time, right?
In the new patch below I used (void)-EFAULT so that you would get an
error at compile time (as suggested
in __builtin_choose_expr in gcc's manual), if that branch would ever get
expanded. Does that sound right?

     __typeof__(*(ptr)) __pu_val; \
     __chk_user_ptr(ptr); \
     might_fault(); \
     __pu_val = x; \
+ /* return value is 0 or -EFAULT, both fit in 1 byte, and \
+ * are sign-extendable to int */ \
     switch (sizeof(*(ptr))) { \
     case 1: \
         __put_user_x(1, __pu_val, ptr, __ret_pu); \
@@ -261,7 +263,7 @@ extern void __put_user_8(void);
         __put_user_x(X, __pu_val, ptr, __ret_pu); \
         break; \
     } \
- __ret_pu; \
+ (int)__ret_pu; \
})

#define __put_user_size(x, ptr, size, retval, errret) \
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index f456860..12d27f8 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -112,7 +112,7 @@ EXPORT_SYMBOL(__delay);

inline void __const_udelay(unsigned long xloops)
{
- int d0;
+ unsigned long d0;

     xloops *= 4;
     asm("mull %%edx"
    
Is this all that you need (plus the 16-bit setup code tweaks)

The 16-bit setup code is compiled, but obviously doesn't work.
I think the best approach would be for LLVM to give a warning/error, when
-fno-unit-at-a-time is used, since it doesn't support that.

to get LLVM
to successfully build a 64-bit kernel image?
  
With the .config I sent previously, yes. With some other .config most
likely more changes are needed,
for example the SMP code, KVM code, but as I said in my previous email I
don't know how to "fix" the inline asm in that case.

There's something wrong with building some modules also, I keep getting
an "idr_init" undefined error, but the symbol is present in my vmlinux.
If I turn make those modules built into the kernel it works then (sctp,
w1, thermalsysfs).

It looks like I'll also have to submit some patches for ARCH=um, because
I get undefined references to __bad_size, __guard, and

__stack_smash_handler. __bad_size is probably because LLVM didn't inline expand + DCE something GCC did.

With an unpatched LLVM I would also need weak attributes to be on the
function type, instead of the return type,
but I think thats an LLVM bug, and a one-line patch corrects it.

If yes then this doesnt look all that bad or invasive at first sight (if
the put_user() workaround can be expressed in a cleaner way), but in any
case it would be nice to hear an LLVM person's opinion about roughly when
this is going to be solved in LLVM itself.
  
Yes, that is why LLVMDev is on Cc:, somebody will eventually reply :wink:

Not-Signed-off-by: Török Edwin <edwintorok@gmail.com>

Hi Ingo,

We would like to support some more specific cases (e.g. when tying a pointer/int to a different size pointer/int) but we currently don't intend to support all cases (e.g. tying a FP value to int). We are in this position because the semantics are very vague and hard to reason about (and change based on target endianness) and we had many subtle bugs in the corner cases.

Instead of having silent miscompiles, we decided to just reject all the "hard" cases and add them back one by one as there is demand. That way users could choose to modify their asms instead of having them be potentially silently miscompiled.

LLVM 2.5 is in its release process right now, so it will not have improvements in this area, but LLVM 2.6 certainly could. If there is interest in building the kernel with 2.5, I think taking the patches would be worthwhile. If that is hopeless anyway, waiting for the LLVM-side fixes should be fine.

-Chris

Török Edwin <edwintorok@gmail.com> writes:

@@ -239,11 +239,13 @@ extern void __put_user_8(void);
  */
#define put_user(x, ptr) \
({ \
- int __ret_pu; \
+ __typeof__(*(ptr)) __ret_pu; \
     __typeof__(*(ptr)) __pu_val; \
     __chk_user_ptr(ptr); \
     might_fault(); \
     __pu_val = x; \
+ /* return value is 0 or -EFAULT, both fit in 1 byte, and \
+ * are sign-extendable to int */ \

That does not work when *ptr is unsigned (char or short).

Andreas.

Chris Lattner wrote:

We would like to support some more specific cases (e.g. when tying a
pointer/int to a different size pointer/int) but we currently don't
intend to support all cases (e.g. tying a FP value to int). We are in
this position because the semantics are very vague and hard to reason
about (and change based on target endianness) and we had many subtle
bugs in the corner cases.

Instead of having silent miscompiles, we decided to just reject all the
"hard" cases and add them back one by one as there is demand. That way
users could choose to modify their asms instead of having them be
potentially silently miscompiled.

The case that matters for the kernel is integer to integer, when a
register is re-used from input to output.

LLVM 2.5 is in its release process right now, so it will not have
improvements in this area, but LLVM 2.6 certainly could. If there is
interest in building the kernel with 2.5, I think taking the patches
would be worthwhile. If that is hopeless anyway, waiting for the
LLVM-side fixes should be fine.

The patches don't look all that bad to me, but I really want to make
sure we don't keep littering the kernel with workarounds for N different
compilers without getting a track to have them cleaned up.

  -hpa

This is indeed a bug in llvm. We'll find a way to get it fixed.

Hi,

If yes then this doesnt look all that bad or invasive at first sight (if
the put_user() workaround can be expressed in a cleaner way), but in any
case it would be nice to hear an LLVM person's opinion about roughly when
this is going to be solved in LLVM itself.

one thing that seems to be clear to everyone except me is... what are the
semantics supposed to be? [My understanding is that what is being discussed
is when you have an asm with a register as input and output, but with integer
types of different width for the input and output, but I saw some mention of
struct types in this thread...]. Presumably this is something obvious, but
it would be good to have someone spell it out in small words that even someone
like me can understand :slight_smile:

Thanks,

Duncan.

I don't know of any other semantic other than, if they are supposed to be in the same register, then they have to be in the same register.

Sounds logical! But what is the discussion about then?

Ciao,

Duncan.

LLVM's Codegen is rejecting inline asm when input/output constraints
have different bitwidths.

For example in the Linux kernel calls to the various __put_user_
functions take %al, %ax, %eax, %rax/ (%eax:%edx) as input parameter,
and the output parameter is always an int (%eax). (hope I explained this
right)

But if I got it right, the input can also be a struct (who's size is 1,
2, 4, 8-byte) that fits in a register.
Not sure if this ever occurs in practice.

#define __put_user_x(size, x, ptr, __ret_pu) \
    asm volatile("call __put_user_" #size : "=a" (__ret_pu) \
             :"0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")

#define put_user(x, ptr) \
({ \
    int __ret_pu; \
    __typeof__(*(ptr)) __pu_val; \
    __chk_user_ptr(ptr); \
    __pu_val = x; \
    switch (sizeof(*(ptr))) { \
    case 1: \
        __put_user_x(1, __pu_val, ptr, __ret_pu); \
        break; \
    case 2: \
        __put_user_x(2, __pu_val, ptr, __ret_pu); \
        break; \
    case 4: \
        __put_user_x(4, __pu_val, ptr, __ret_pu); \
        break; \
    case 8: \
        __put_user_x8(__pu_val, ptr, __ret_pu); \
        break; \
    default: \
        __put_user_x(X, __pu_val, ptr, __ret_pu); \
        break; \
    } \
    __ret_pu; \
})

Best regards,
--Edwin

As near as I can tell, it is about trying to stir up enough interest in the issues involved, so that we can get a clean compile of the linux kernel. This just requires bug fixing until it works. Some of the people capable of bug fixing, would like simple, easy testcases to facilitate such bug fixing. And the linux people would like to hear from us when we expect to have it fixed. That's how I see it.

Right, the interesting wrinkle is that I think llvm's codegen should only see matching constraints of the same type. If the front-end sees an i8 tied to an i32, it should extend the i8 or do whatever else it needs to do.

-Chris

My only concern is whether codegen will be able to optimize away extra
extensions/truncations or not?
Consider an i8 input, and i32 output. If the frontend extends the i8 so
that types match, the codegen should optimize away the extend eventually.
Otherwise we end up with useless extends in the generated assembly.

Best regards,
--Edwin

Duncan Sands wrote:

Hi,

If yes then this doesnt look all that bad or invasive at first sight (if the put_user() workaround can be expressed in a cleaner way), but in any case it would be nice to hear an LLVM person's opinion about roughly when this is going to be solved in LLVM itself.

one thing that seems to be clear to everyone except me is... what are the
semantics supposed to be? [My understanding is that what is being discussed
is when you have an asm with a register as input and output, but with integer
types of different width for the input and output, but I saw some mention of
struct types in this thread...]. Presumably this is something obvious, but
it would be good to have someone spell it out in small words that even someone
like me can understand :slight_smile:

I don't know about struct types, but the situation I'm talking about is assembly statements of the form:

asm("foo" : "=r" (bar) : "0" (baz));

Here, "bar" and "baz" are constrained to be in the same hardware register (from the "0" constraint in "baz"). The types of "bar" and "baz" are otherwise unrelated.

I assume the difficulty here comes from how this needs to be handled from the point of view of the register allocator. If both types fit inside a single allocatable hardware register, the issue is trivial; "bar" and "baz" form a single logical register for the purpose of register allocation.

However, things get a bit ugly in the case of different widths that affect individually scheduled registers, like 32- and 64-bit types on a 32-bit machine. Consider the case above where "bar" is a 64-bit type and "baz" is a 32-bit type, then you functionally have, at least on x86:

  uint64_t tmp = bar;
  asm("foo" : "+r" (tmp));
  baz = (uint32_t)tmp;

One could possibly argue that the latter case should be
"baz = (uint32_t)(tmp >> 32);" on a bigendian machine... since this is a gcc syntax it probably should be "whatever gcc does" in that case, as opposed to what might make sense.

(I'm afraid I don't have a bigendian box readily available at the moment, so I can't test it out to see what gcc does. I have a powerpc machine, but it's at home and turned off.)

  -hpa

Right, the interesting wrinkle is that I think llvm's codegen should
only see matching constraints of the same type. If the front-end sees
an i8 tied to an i32, it should extend the i8 or do whatever else it
needs to do.

My only concern is whether codegen will be able to optimize away extra
extensions/truncations or not?

For truncates, yes, extensions no.

Consider an i8 input, and i32 output. If the frontend extends the i8 so
that types match, the codegen should optimize away the extend eventually.
Otherwise we end up with useless extends in the generated assembly.

The extend could be easily optimized away if llvm ir had a concept of "any extend" like codegen does.

-Chris

Actually, PPC64 boxes basically don't care... the usable GPRs are all
either 32-bit (for PPC32) or 64-bit (for PPC64), the <=32-bit
instructions are identical across both, they just
truncate/sign-extend/etc based on the lower 32-bits of the register.
Also, you would only do a right-shift if you were going all the way
out to memory as 64-bit and all the way back in as 32-bit... within a
single register it's kept coherent.

Structs are basically irrelevant for inline ASM as you can't pass a
struct to one... you can only pass the *address* of a struct, which is
always pointer-sized.

I think that really the only sane solution (which is hopefully what
GCC does) for integer types is to use a register the same size as the
larger of the two integers. Then you copy the value to/from the
smaller register (or just mask it on PPC64-alike architectures) before
or after the inline ASM.

Cheers,
Kyle Moffett

Kyle Moffett wrote:

Actually, PPC64 boxes basically don't care... the usable GPRs are all
either 32-bit (for PPC32) or 64-bit (for PPC64), the <=32-bit
instructions are identical across both, they just
truncate/sign-extend/etc based on the lower 32-bits of the register.
Also, you would only do a right-shift if you were going all the way
out to memory as 64-bit and all the way back in as 32-bit... within a
single register it's kept coherent.

Think about a 64-bit integer on ppc32. It will by necessity kept in two registers. On gcc I believe it will always be a consecutive pair of registers (AFAIK that's a hard-coded assumption in gcc, with the result that gcc has a nonstandard internal register numbering for x86 since the commonly used dx:ax pair is actually registers 2:0 in the hardware numbering.)

Structs are basically irrelevant for inline ASM as you can't pass a
struct to one... you can only pass the *address* of a struct, which is
always pointer-sized.

Right, of course.

I think that really the only sane solution (which is hopefully what
GCC does) for integer types is to use a register the same size as the
larger of the two integers. Then you copy the value to/from the
smaller register (or just mask it on PPC64-alike architectures) before
or after the inline ASM.

Pretty much. Then you can do conventional copy propagation and elimination after expanding subregisters to get rid of the extra ops in the common case.

  -hpa

Hi Chris,

> LLVM's Codegen is rejecting inline asm when input/output constraints
> have different bitwidths.
>
> For example in the Linux kernel calls to the various __put_user_
> functions take %al, %ax, %eax, %rax/ (%eax:%edx) as input parameter,
> and the output parameter is always an int (%eax). (hope I explained
> this
> right)

Right, the interesting wrinkle is that I think llvm's codegen should
only see matching constraints of the same type. If the front-end sees
an i8 tied to an i32, it should extend the i8 or do whatever else it
needs to do.

if I understand right the two cases are (for example):

(1): input (I) is i32, output (O) is i8. In this case the asm output
type should be changed to i32 (O'), and O is obtained by truncating
O' to i8.

(2): input (I) is i8, output (O) is i32. In this case the asm input
type should be changed to i32 (I'), and I' is obtained from I by
extending I to i32.

Is that it?

Thanks,

Duncan.

Yes exactly. On big endian targets a shift is also needed.

-Chris