clang trunk: extern "C"/static problem

Hi,

it seems there is a bug in current trunk with regards to the symbols that are generated. At least clang 3.1, gcc 4.3.4 and gcc 4.8 behave differently here.

test.cpp:

extern "C"
{

static void __attribute__((__used__)) func(char *a, char b)
{
}

}

clang++ test.cpp

nm -C test*.o shows these generated symbols for the different compilers:

test_clang31.o:
0000000000000000 t func

test_clang33.o:
0000000000000000 t func(char*, char)

test_gcc43.o:
                 U __gxx_personality_v0
0000000000000000 t func

test_gcc48.o:
0000000000000000 t func

It looks like the static in conjunction with the extern "C" is causing a problem here. Should I file a bug?

Best regards,
Martin

The c++ standard
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf)
says

Hi Rafael,

in this case the affected function is called from another function which is implemented in inline assembler, within the extern "C" scope. These symbols are emitted:

                 U func
0000000000000110 t func(long*, long)

Afterwards the link will fail because "func" is missing. The exact error is:

"relocation R_X86_64_PC32 against undefined symbol `func' can not be used when making a shared object; recompile with -fPIC"

(I compiled with -fPIC)

Best regards,
Martin

Ah, you know, I can't believe we didn't think about inline assembly when we were enumerating potential places where this would bite us. Too fixated on the JIT case, I suppose.

Personally, I think this basically kills the idea of treating these as overloadable, although I suppose you could try to hook it in to attribute((used)) if you *really* want to maintain that.

John.

From: Rafael Espíndola [mailto:rafael.espindola@gmail.com]
Sent: Donnerstag, 14. März 2013 15:42
To: Richtarsky, Martin
Cc: cfe-dev@cs.uiuc.edu Developers; John McCall; Richard Smith
Subject: Re: [cfe-dev] clang trunk: extern “C”/static problem

Hi,

it seems there is a bug in current trunk with regards to the symbols that
are generated. At least clang 3.1, gcc 4.3.4 and gcc 4.8 behave differently
here.

test.cpp:

extern “C”
{

static void attribute((used)) func(char *a, char b)
{
}

}

clang++ test.cpp

nm -C test*.o shows these generated symbols for the different compilers:

test_clang31.o:
0000000000000000 t func

test_clang33.o:
0000000000000000 t func(char*, char)

test_gcc43.o:
U __gxx_personality_v0
0000000000000000 t func

test_gcc48.o:
0000000000000000 t func

It looks like the static in conjunction with the extern “C” is causing a
problem here. Should I file a bug?

The c++ standard
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3376.pdf)
says


All function types, function names with external linkage, and variable
names with external linkage have a language linkage.

Since static functions have internal linkage, they don’t have a
language linkage. That is, the extern “C” doesn’t apply.

This was recently implemented in clang, and there was some discussion
about maybe trying to change the standard instead. What problems is
this causing?

Hi Rafael,

in this case the affected function is called from another function which is implemented in inline assembler, within the extern “C” scope.

Ah, you know, I can’t believe we didn’t think about inline assembly when we were enumerating potential places where this would bite us. Too fixated on the JIT case, I suppose.

We did think about it (although I’m not sure whether that made it to the list); in fact, this happened in some cases inside LLVM. Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Hi Rafael,

in this case the affected function is called from another function which is implemented in inline assembler, within the extern “C” scope.

Ah, you know, I can’t believe we didn’t think about inline assembly when we were enumerating potential places where this would bite us. Too fixated on the JIT case, I suppose.

We did think about it (although I’m not sure whether that made it to the list); in fact, this happened in some cases inside LLVM.

I do not remember it coming up about inline assembly. I remember it coming up about dynamic symbol lookups, and we were willing to wave our hands about those because there is very little code in the world that does this sort of symbol lookup into the enclosing process. I am not willing to hand-wave away inline assembly.

Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Conveniently enough, inline assembly usually can’t be shared between platforms that do Pascal mangling and those that don’t, because significantly different platforms usually have significantly different compilers and with significantly different inline assembly syntax. What you’re doing is making it more awkward to port inline assembly between compilers on the same platform by introducing a totally spurious hurdle, based on a line from the standard that’s inconsistent with an overwhelmingly dominant existing practice.

John.

Hi Rafael,

in this case the affected function is called from another function which is implemented in inline assembler, within the extern “C” scope.

Ah, you know, I can’t believe we didn’t think about inline assembly when we were enumerating potential places where this would bite us. Too fixated on the JIT case, I suppose.

We did think about it (although I’m not sure whether that made it to the list); in fact, this happened in some cases inside LLVM.

I do not remember it coming up about inline assembly. I remember it coming up about dynamic symbol lookups, and we were willing to wave our hands about those because there is very little code in the world that does this sort of symbol lookup into the enclosing process. I am not willing to hand-wave away inline assembly.

Considering dynamic linking seems a bit strange; we were only ever talking about internal linkage functions. Here’s where I mentioned inline asm, and asm labels, previously:

http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130218/074660.html

Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Conveniently enough, inline assembly usually can’t be shared between platforms that do Pascal mangling and those that don’t, because significantly different platforms usually have significantly different compilers and with significantly different inline assembly syntax.

Inconveniently, we have to cope with exactly that when we call into the problematic symbols in llvm/lib/Target/X86/X86JITInfo.cpp (this is what the message I quoted above was referring to). Search for calls to LLVMX86CompilationCallback2, and note the ASMPREFIX hack.

What you’re doing is making it more awkward to port inline assembly between compilers on the same platform by introducing a totally spurious hurdle, based on a line from the standard that’s inconsistent with an overwhelmingly dominant existing practice.

I don’t agree that g++ counts as “overwhelmingly dominant existing practice”, especially given that EDG does not follow g++ here in its g+±compatible mode. This would not be the first g++ bug which people have come to rely on, which we could support at the expense of being subtly non-conforming, but choose not to. Plus, there is a simple, trivial fix to the offending code which allows it to be accepted by us, g++, and EDG.

I think the right question is, is this a battle worth fighting? Is one inconvenienced user enough that we should give up any hope of ever conforming in this area?

Personally, I think this basically kills the idea of treating these as overloadable, although I suppose you could try to hook it in to attribute((used)) if you *really* want to maintain that.

I would prefer not to. I think we should decide whether or not to give
language linkage to internal functions and variables and let the rest
follow. The one hack that I think could be used as a compromise is
special casing the mangling of non overloaded static functions if
someone can suggest a convenient way of implementing it.

If we do go the gcc way and give these functions and variables
language linkage, can someone volunteer to push this into the
standard?

John.

Thanks,
Rafael

Hi Rafael,

in this case the affected function is called from another function which is implemented in inline assembler, within the extern “C” scope.

Ah, you know, I can’t believe we didn’t think about inline assembly when we were enumerating potential places where this would bite us. Too fixated on the JIT case, I suppose.

We did think about it (although I’m not sure whether that made it to the list); in fact, this happened in some cases inside LLVM.

I do not remember it coming up about inline assembly. I remember it coming up about dynamic symbol lookups, and we were willing to wave our hands about those because there is very little code in the world that does this sort of symbol lookup into the enclosing process. I am not willing to hand-wave away inline assembly.

Considering dynamic linking seems a bit strange; we were only ever talking about internal linkage functions.

Hmm; I was remembering that there was a platform where dlsym searched internal-linkage functions based on call site, but I think I’m just flat-out wrong.

Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Conveniently enough, inline assembly usually can’t be shared between platforms that do Pascal mangling and those that don’t, because significantly different platforms usually have significantly different compilers and with significantly different inline assembly syntax.

Inconveniently, we have to cope with exactly that when we call into the problematic symbols in llvm/lib/Target/X86/X86JITInfo.cpp (this is what the message I quoted above was referring to). Search for calls to LLVMX86CompilationCallback2, and note the ASMPREFIX hack.

Okay, so projects that care about platform portability already have workarounds for this, but:

  • many projects don’t currently care about platform portability and
  • most projects that do will already have perfectly functioning workarounds using USER_LABEL_PREFIX because
  • projects have to use USER_LABEL_PREFIX anyway if they want to use library functions.

So basically, all such projects should have yet another (subtle) obstacle to porting to clang just so that we can conform to the standard in the same way as AFAICT precisely one other compiler which happens to be famously pedantic. All of this is in an area where there’s already massive and permanent non-conformance, because the standard’s interest in allowing overloaded static extern “C” functions is to make it easier to define functions whose function type has C language linkage.

What you’re doing is making it more awkward to port inline assembly between compilers on the same platform by introducing a totally spurious hurdle, based on a line from the standard that’s inconsistent with an overwhelmingly dominant existing practice.

I don’t agree that g++ counts as “overwhelmingly dominant existing practice”,

g++ and MSVC and probably every other compiler in the world except apparently EDG. That would be EDG, the compiler whose authors took it as a point of pride to spend several programmer-years implementing template export.

The C++ committee is very bad about standardizing existing practice instead of just making things up to serve their own fantasies. It’s a serious problem. It doesn’t mean we should ignore everything they say, but yes, I think it means we have to actually consider things before just implementing them as specified. The fact that we usually do is pretty well evidenced by the number of DRs we’ve authored.

especially given that EDG does not follow g++ here in its g+±compatible mode.

Yeah, I somehow suspect EDG not being fully compatible with g++ in its g+±compatible mode is considered a bug over in New Jersey, not some deeply principled stand.

Can you find any compilers besides EDG that mangle static symbols within extern “C”?

This would not be the first g++ bug which people have come to rely on, which we could support at the expense of being subtly non-conforming, but choose not to. Plus, there is a simple, trivial fix to the offending code which allows it to be accepted by us, g++, and EDG.

It would also not be the first GCC quirk that we emulated because we saw no real value in deviating. We didn’t have to implement GCC’s weird bitfield promotion rules. We didn’t have to copy that bizarre thing where GCC passes part of a union in the upper half of an SSE register. We could have implemented the actual C++ rules about function types with language linkage.

We chose to fight users about the template model because we saw value in implementing the model as specified and because GCC’s model was badly enough broken that it would have seriously interfered with reasonable extension (and they have since gotten much better). We implemented C++11 as specified because it was new and therefore a chance to start afresh. I do not think these things mean that we’re ready to go to the barricades over every last piece of minutia.

I think the right question is, is this a battle worth fighting? Is one inconvenienced user enough that we should give up any hope of ever conforming in this area?

Given that most people are not, in fact, randomly recompiling their code with a top-of-tree clang several months before the next open-source release, two known inconvenienced users in two weeks seems pretty significant to me.

John.

Inconveniently, we have to cope with exactly that when we call
into the problematic symbols in llvm/lib/Target/X86/X86JITInfo.cpp (this is what
the message I quoted above was referring to). Search for calls to
LLVMX86CompilationCallback2, and note the ASMPREFIX hack.

FYI, this is the same file where I ran into this problem (llvm3.1 is part of the project I compiled). I saw that it has been fixed in clang trunk by just removing the static.

Best regards,
Martin

Given that most people are not, in fact, randomly recompiling their code
with a top-of-tree clang several months before the next open-source release,
two known inconvenienced users in two weeks seems pretty significant to me.

The attached patch gives language linkage to functions and variables
with linkage. In the last two weeks I have added some tests that
depended on internal symbols not having language linkage, so the test
changes in the patch probably document all the interesting user
visible changes.

John.

Cheers,
Rafael

t.patch (6.73 KB)

Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Conveniently enough, inline assembly usually can’t be shared between platforms that do Pascal mangling and those that don’t, because significantly different platforms usually have significantly different compilers and with significantly different inline assembly syntax.

Inconveniently, we have to cope with exactly that when we call into the problematic symbols in llvm/lib/Target/X86/X86JITInfo.cpp (this is what the message I quoted above was referring to). Search for calls to LLVMX86CompilationCallback2, and note the ASMPREFIX hack.

What you’re doing is making it more awkward to port inline assembly between compilers on the same platform by introducing a totally spurious hurdle, based on a line from the standard that’s inconsistent with an overwhelmingly dominant existing practice.

I don’t agree that g++ counts as “overwhelmingly dominant existing practice”, especially given that EDG does not follow g++ here in its g+±compatible mode. This would not be the first g++ bug which people have come to rely on, which we could support at the expense of being subtly non-conforming, but choose not to. Plus, there is a simple, trivial fix to the offending code which allows it to be accepted by us, g++, and EDG.

The existing practice is g++, MSVC (did anyone confirm this?), and all released Clang versions. EDG is the outlier here, and (of the compilers we’re talking about), the one with the smallest install base, so I think it’s fairly safe to say that existing practice is to not mangle these names.

I think the right question is, is this a battle worth fighting? Is one inconvenienced user enough that we should give up any hope of ever conforming in this area?

This first question goes both ways. Do Clang’s users benefit from a change in this area? It seems that users porting from g++ or MSVC, or upgrading their Clang do not benefit (at least not immediately) because they will need to change their code, and that change won’t necessarily make their code that much more portable.

We have a conflict between existing practice and the C++ standard, and I don’t think we have a strong case for changing Clang’s behavior in the name of conformance. I think the C++ committee needs to decide whether to adapt the standard to cover existing practice or to reaffirm that this aspect of the language-linkage model is intended despite conflicting with existing practice.

  • Doug

Our conclusion was: if you want to use a function from inline asm, you should use an asm label on that function, otherwise it might get mangled unexpectedly. This is true independent of the static/extern “C” issue, due to some platforms prepending an underscore to symbol names, etc.

Conveniently enough, inline assembly usually can’t be shared between platforms that do Pascal mangling and those that don’t, because significantly different platforms usually have significantly different compilers and with significantly different inline assembly syntax.

Inconveniently, we have to cope with exactly that when we call into the problematic symbols in llvm/lib/Target/X86/X86JITInfo.cpp (this is what the message I quoted above was referring to). Search for calls to LLVMX86CompilationCallback2, and note the ASMPREFIX hack.

What you’re doing is making it more awkward to port inline assembly between compilers on the same platform by introducing a totally spurious hurdle, based on a line from the standard that’s inconsistent with an overwhelmingly dominant existing practice.

I don’t agree that g++ counts as “overwhelmingly dominant existing practice”, especially given that EDG does not follow g++ here in its g+±compatible mode. This would not be the first g++ bug which people have come to rely on, which we could support at the expense of being subtly non-conforming, but choose not to. Plus, there is a simple, trivial fix to the offending code which allows it to be accepted by us, g++, and EDG.

The existing practice is g++, MSVC (did anyone confirm this?), and all released Clang versions. EDG is the outlier here, and (of the compilers we’re talking about), the one with the smallest install base, so I think it’s fairly safe to say that existing practice is to not mangle these names.

True. However, EDG vends an ostensibly drop-in replacement for GCC, which has a significant user base. That makes me find it hard to believe this is a significant problem – if it were, I would have expected that EDG would have been informed of it and would have fixed it by now.

I think the right question is, is this a battle worth fighting? Is one inconvenienced user enough that we should give up any hope of ever conforming in this area?

This first question goes both ways. Do Clang’s users benefit from a change in this area? It seems that users porting from g++ or MSVC, or upgrading their Clang do not benefit (at least not immediately) because they will need to change their code, and that change won’t necessarily make their code that much more portable.

We have a conflict between existing practice and the C++ standard, and I don’t think we have a strong case for changing Clang’s behavior in the name of conformance. I think the C++ committee needs to decide whether to adapt the standard to cover existing practice or to reaffirm that this aspect of the language-linkage model is intended despite conflicting with existing practice.

The committee has already reaffirmed this once (albeit quite a long time ago).

Having just discussed this at length with fellow CWG member James Dennett, we think (hopefully James will correct me if I’m misstating something):

  • Relying on the names of internal-linkage symbols does not seem particularly reasonable,
  • The status quo (Clang rejecting the code in question) also seems far from ideal,
  • We are about an order of magnitude below having enough evidence to justify a change to the standard,
  • It’s not reasonable for Clang to be permanently non-conforming here.

Based on the above, I’d like to suggest a solution: we teach CodeGenModule to keep track of the internal-linkage functions and variables which are declared within C language linkage blocks, and when we reach the end of the module, for each such name, if (1) we saw exactly one function or variable with that name, and (2) that name is not yet defined in the module, we emit an internal linkage alias mapping the “expected” name to the mangled name.

This should be a pretty minimal and non-invasive change, and allows us to both conform and accept the code in question. Does that seem OK to everyone?

The committee has already reaffirmed this once (albeit quite a long time
ago).

Having just discussed this at length with fellow CWG member James Dennett,
we think (hopefully James will correct me if I'm misstating something):

* Relying on the names of internal-linkage symbols does not seem
particularly reasonable,
* The status quo (Clang rejecting the code in question) also seems far from
ideal,
* We are about an order of magnitude below having enough evidence to
justify a change to the standard,
* It's not reasonable for Clang to be permanently non-conforming here.

Based on the above, I'd like to suggest a solution: we teach CodeGenModule
to keep track of the internal-linkage functions and variables which are
declared within C language linkage blocks, and when we reach the end of the
module, for each such name, if (1) we saw exactly one function or variable
with that name, and (2) that name is not yet defined in the module, we emit
an internal linkage alias mapping the "expected" name to the mangled name.

This should be a pretty minimal and non-invasive change, and allows us to
both conform and accept the code in question. Does that seem OK to everyone?

I think I am ok with it. One problem with going this way is that we
don't have any semantic checking for the "sudo language linkage" going
on, so it is not very clear what "declare within C language blocks"
means. For example, should this 'f' be on the list to get an alias?

static void f();
extern "C" {
  void f();
}
void f() {
}
void use() {
  f();
}

Cheers,
Rafael

The committee has already reaffirmed this once (albeit quite a long time
ago).

Having just discussed this at length with fellow CWG member James Dennett,
we think (hopefully James will correct me if I’m misstating something):

  • Relying on the names of internal-linkage symbols does not seem
    particularly reasonable,
  • The status quo (Clang rejecting the code in question) also seems far from
    ideal,
  • We are about an order of magnitude below having enough evidence to
    justify a change to the standard,
  • It’s not reasonable for Clang to be permanently non-conforming here.

Based on the above, I’d like to suggest a solution: we teach CodeGenModule
to keep track of the internal-linkage functions and variables which are
declared within C language linkage blocks, and when we reach the end of the
module, for each such name, if (1) we saw exactly one function or variable
with that name, and (2) that name is not yet defined in the module, we emit
an internal linkage alias mapping the “expected” name to the mangled name.

This should be a pretty minimal and non-invasive change, and allows us to
both conform and accept the code in question. Does that seem OK to everyone?

I think I am ok with it. One problem with going this way is that we
don’t have any semantic checking for the “sudo language linkage” going
on, so it is not very clear what “declare within C language blocks”
means. For example, should this ‘f’ be on the list to get an alias?

I’m fine with requiring the definition to be lexically within an extern “C” block; I view this as being entirely a quality-of-implementation issue, though, so doing better than that would also work for me.

I'm fine with requiring the definition to be lexically within an extern "C"
block; I view this as being entirely a quality-of-implementation issue,
though, so doing better than that would also work for me.

That would work. I still have a preference for one of the two "basic"
solutions: The status quo or just giving internal functions and
variables language linkage.

The problem that I see with the proposal of adding aliases is that we
are creating a new concept: functions (and variables) that have no
language linkage but are defined in an extern C context. If we
implement that, it would not be unreasonable for users to ask for a
warning in the previous example: the function was just declared (and
not defined) in an extern C context.

Cheers,
Rafael

That seems acceptable to me.

John.