MCJIT RemoteMemoryManager Failures on ARM

Hi Andrew & Takumi,

I’m including you guys because you seem to know what should and what shouldn’t work, given your recent changes to the XFAILs. Feel free to exclude yourselves, or include other folks.

I’m having a problem with a self-hosting ARM bot described here:

http://llvm.org/bugs/show_bug.cgi?id=18057

Basically, allocateSpace() sometimes works, sometimes doesn’t. I couldn’t see much wrong with the code, and the call graph looks valid, so I’m guessing it’s something in the higher logic (message passing) that is non-deterministically failing.

I even tried to run manually several times in a row and sometimes it fails, sometimes it doesn’t, so it’s not a hard error in code generation, but the GCC generated binary NEVER fails, while the Clang generated ones, sometimes do.

I want to introduce a self-hosting buildbot on ARM, but this is stopping me from doing so, since the tests fail randomly at every run. Any ideas?

Thanks,
–renato

IMHO, tests may be suppressed with lit.local.cfg, only if;

  - The issue is filed -- PR18057
  - There is at least one person responsible to watch on this issue.

Suppression is not good idea. But it should be done anyways if
responsible person were there.
I would not help you since I don't have any arm boxes.

Thank you.

IMHO, tests may be suppressed with lit.local.cfg, only if;

They were disabled, but then re-enabled because they don't fail on normal
build bots, but they do on self-host bots. So, I think this is a more
serious issue than just MCJIT, I think this is a Clang miscompilation. I'll
try on x86 to see if the self-hosting problem appears.

  - The issue is filed -- PR18057

  - There is at least one person responsible to watch on this issue.

Yes, but without priority, the bug will be forgotten.

I would not help you since I don't have any arm boxes.

I have an ARM box which you can SSH into, if you need. My knowledge of how
the MCJIT should behave is very limited. If you're willing to take a look,
I'd do my best to help you working on ARM.

cheers,
--renato

I’d be OK with disabled these tests if they can be specifically disabled for ARM with clang.

The main purpose of these tests is to ensure that none of the MCJIT/RuntimeDyld/*MemoryManager layering gets rearranged in a way that makes remote execution impossible.

I would also note that the failure isn’t actually in anything MCJIT-specific. Aside from the fact that it seems to be clang-specific, the code that is failing is specific to the lli remote implementation. It’s not clear to me why it would fail under aggressive optimization with clang, but I wouldn’t characterize that code as particularly robust.

I just updated the bugzilla report with a few comments about the failure. The short of it is that there’s nothing MCJIT-specific about this failure. It’s most likely a pipe I/O problem. I think it’s possible that the clang optimizations are just exposing a timing-related vulnerability in the pipe handling.

-Andy

I would also note that the failure isn’t actually in anything
MCJIT-specific. Aside from the fact that it seems to be clang-specific,
the code that is failing is specific to the lli remote implementation.
It’s not clear to me why it would fail under aggressive optimization with
clang, but I wouldn’t characterize that code as particularly robust.

I agree. I think this is more likely a codegen fault on Clang's side that
crashes the client, not even the remote implementation, that even being
crude, has very little room for failure of that magnitude.

I just updated the bugzilla report with a few comments about the failure.

The short of it is that there’s nothing MCJIT-specific about this failure.
It’s most likely a pipe I/O problem. I think it’s possible that the clang
optimizations are just exposing a timing-related vulnerability in the pipe
handling.

Ok, I'll disable those tests for ARM for now and will look into the bug.

I don't know much about how MCJIT works, so creating the reduced test case
will prove difficult. But I'll progress, because I do want MCJIT to work
well on ARM, and disabling tests is the wrong way to head. :wink:

cheers,
--renato

Looking at the code, one obvious source of intermittent failure is that the Linux implementations of ReadBytes and WriteBytes don’t check for EINTR. I doubt that’s the failure you’re seeing because it would be more randomly distributed but it’s something that should be fixed.

More likely as the cause of failure in your case is that read is returning less than the number of bytes requested. In theory, this can happen if we read one end of the pipe while the other end is being written, but the current code doesn’t check for it. A race condition like this seems more likely than a code generation problem.

I’m attaching a patch (which I haven’t even tried to compile) that I think addresses these issues. Can you try it out and see if it fixes this problem for you?

If this doesn’t do the trick, by stepping through the remote case in the debugger you can see what the communication is leading up to the failure. From there it should be relatively simple to use just the RemoteTargetExternal class to create a test driver that communicates with the child process in the same way. This ought to give you a failing test case completely independent of any of significant part of LLVM (unless the failure is entirely timing dependent).

Thanks,

Andy

lli-remote-comm.patch (2.93 KB)

Looking at the code, one obvious source of intermittent failure is that
the Linux implementations of ReadBytes and WriteBytes don’t check for
EINTR. I doubt that’s the failure you’re seeing because it would be more
randomly distributed but it’s something that should be fixed.

Agreed.

More likely as the cause of failure in your case is that read is returning

less than the number of bytes requested. In theory, this can happen if we
read one end of the pipe while the other end is being written, but the
current code doesn’t check for it. A race condition like this seems more
likely than a code generation problem.

Right. What I meant by a codegen problem was not *just* a crash in the
client, but code movement that would induce instability, like moving things
beyond memory barriers, etc. However, I agree that the code, as it is, is
not robust enough and that the compiler can be more aggressive to remove
the lucky balance it has now.

I’m attaching a patch (which I haven’t even tried to compile) that I think

addresses these issues. Can you try it out and see if it fixes this
problem for you?

The patch indeed fixes the problem, but it introduces lock-ups on other
(random) tests when they run simultaneously, but not so when I run them
independently.

cheers,
--renato