Cannot parse the Linux kernel

Hi there,

there are heavily used versions of the Linux kernel that
cannot be parsed by clang due to the following bugs:

http://llvm.org/bugs/show_bug.cgi?id=4236
http://llvm.org/bugs/show_bug.cgi?id=3429

Are there plans to fix them? Note that changing the
kernel code is, unfortunately, not an option.
Thanks,

    Roberto

We definitely intend to implement __label__ support relatively soon;
it's just a matter of someone writing the patch. (It's non-trivial,
but it shouldn't be particularly hard.)

As for PR4236, I'm not really sure about the best way to go about
fixing it, or whether we even want to, It'd require some
platform-specific code to figure out how to translate situations like
that into LLVM inline asm in a gcc-compatible way, and we don't really
want to add weird hacks for rare edge cases. If you don't care about
CodeGen, though, you can just comment out the relevant error-checking
code in SemaStmt.cpp; the resulting AST will still be well-formed.

On a side note, what are you trying to do that requiring a one-line
change somehow breaks it? If you don't want to touch your kernel
sources, you could always write a wrapper around clang that patches
the file in question, then pipes it into clang.

-Eli

Eli Friedman wrote:

Hi there,

there are heavily used versions of the Linux kernel that
cannot be parsed by clang due to the following bugs:

http://llvm.org/bugs/show_bug.cgi?id=4236
http://llvm.org/bugs/show_bug.cgi?id=3429

Are there plans to fix them? Note that changing the
kernel code is, unfortunately, not an option.

We definitely intend to implement __label__ support relatively soon;
it's just a matter of someone writing the patch. (It's non-trivial,
but it shouldn't be particularly hard.)

Great!

As for PR4236, I'm not really sure about the best way to go about
fixing it, or whether we even want to, It'd require some
platform-specific code to figure out how to translate situations like
that into LLVM inline asm in a gcc-compatible way, and we don't really
want to add weird hacks for rare edge cases. If you don't care about
CodeGen, though, you can just comment out the relevant error-checking
code in SemaStmt.cpp; the resulting AST will still be well-formed.

Thanks a lot Eli: this gives us a very useful workaround.

On a side note, what are you trying to do that requiring a one-line
change somehow breaks it? If you don't want to touch your kernel
sources, you could always write a wrapper around clang that patches
the file in question, then pipes it into clang.

We are writing a program analyzer that should be able to analyze
existing code in widespread use as it is. Without having to patch
it, no matter how insignificant the patch is.

This should be compatible with the design goals of clang. Of course, if
specific gcc constructs that are not implementable in clang are
encountered, clang will have to give up. But whenever a fatal failure
is avoidable, it should be avoided. This is, IMHO, a key point for the
success of clang and for the success of every project that uses it.

In the specific case, we think that simply converting the fatal failure
into a warning and matching the semantics of gcc is the best option.

Of course we can maintain a patch for clang doing that, but we think
that the best thing for clang itself is to follow the motto "never
die for source code quality nitpicking if gcc can parse it and that
source is in widespread use." It serves no purpose risking potential
users to give up with clang (and with our analyzer), because "out of
the box it doesn't compile my kernel and who knows how many other
sources need to be patched in obscure ways."

All the best,

    Roberto

We are writing a program analyzer that should be able to analyze
existing code in widespread use as it is. Without having to patch
it, no matter how insignificant the patch is.

Okay, I guess that makes sense.

This should be compatible with the design goals of clang. Of course, if
specific gcc constructs that are not implementable in clang are
encountered, clang will have to give up. But whenever a fatal failure
is avoidable, it should be avoided. This is, IMHO, a key point for the
success of clang and for the success of every project that uses it.

Well, it might be nice in some sense, I guess, but we have to draw the
line for sanity somewhere.
http://clang.llvm.org/docs/UsersManual.html#c_unsupp_gcc is a list of
stuff we're simply not intending to support; it's not set in stone,
but there are limits to how far we are planning to go for gcc
compatibility.

In the specific case, we think that simply converting the fatal failure
into a warning and matching the semantics of gcc is the best option.

If we could come up with some reasonable way to model gcc's behavior
for PR4236, we don't really even need to warn. The issue is that we
don't want to accept code that doesn't make sense, and we don't want
to accept code we know IRGen can't deal with.

-Eli