Reporting bugs which only affect (semi-proprietary) downstream consumers.

Hi all,

I’ve recently taken over maintenance of my company’s llvm+lldb branch, where we have added support for our in-house architecture (in llvm) as well as support for debugging through both hardware and our simulator. Our llvm fork is public/open source, however many of our runtime libraries and drivers (which are linked into lldb, clang, etc, and provide built-ins and driver support etc) are not.

While attempting to update our branch from llvm-11 to llvm-12 we came across a commit[1] in lldb which quite reliably causes a deadlock when we launch a process to debug a core dump. Luckily, said commit simply modifies some concurrency primitives, and reverting it is sufficient to fix the bug without any further effects. We are quite confident that the commit is the issue, as we performed a thorough bisect which maintained “our” code unchanged throughout.

Unfortunately, however, we are unable to reproduce this bug in any “open” architectures (such as x86-64, AArch64, etc), so are not entirely sure how we should go about reporting the bug. Additionally, it makes it difficult to open a discussion regarding whether the commit is correct (and thus we may need to modify our additions to lldb to match new implicit behaviour), as third parties may be unable to reproduce the issue. Finally, as the bug results in a deadlock (which requires a sigkill to end) we won’t (as I understand it) be able to use a “Reproducer” to demonstrate the bug to third parties.

Although we are able to “solve” the issue locally (by reverting the commit), we feel that the better solution would be to feed back our findings to the community and solve the issue, rather than (privately) sweeping it under the rug. As components of our compiler are proprietary, however, this process becomes difficult due to the reasons listed above.

To summarise, there are two main questions that I feel unable to answer:

  • Is there an existing process for reporting bugs that only affect third parties, and which cannot be reproduced in “core” targets.
  • To what extend is it possible to discuss (or report) bugs “on faith” - as in without any concrete evidence that a third party can reproduce.

We are currently looking into opening up our build process so that we are able to distribute binary libraries to enable third parties to build our compiler + debugger, but as this is currently a work-in-progress it is unfortunately not a solution to this issue.

Many thanks in advance for any and all advice.
Yours,

Hi Adam,

I think the best idea is to comment on the commit on Phabricator ( reviews.llvm.org ) as it seems to be a relatively recent change. Otherwise if you can somehow provide way to reproduce the deadlock using only code you can share + LLVM.org sources then filing a bug would be an option too.

Regarding what information you should provide: Pretty much everything that you can share would help. At least the backtrace of all threads in the deadlocked state would be good to know. And of course the commit your bisect stopped at if it’s a bug report. From there people might have an idea how to reproduce the issue in a unit test or via the SB API (or what could be going wrong in your downstream fork).

And I believe you can’t use the reproducer feature here as that requires having the respective LLDB binary to replay (which you probably can’t share).

  • Raphael

Hi all,

I’ve recently taken over maintenance of my company’s llvm+lldb branch, where we have added support for our in-house architecture (in llvm) as well as support for debugging through both hardware and our simulator. Our llvm fork is public/open source, however many of our runtime libraries and drivers (which are linked into lldb, clang, etc, and provide built-ins and driver support etc) are not.

While attempting to update our branch from llvm-11 to llvm-12 we came across a commit[1] in lldb which quite reliably causes a deadlock when we launch a process to debug a core dump. Luckily, said commit simply modifies some concurrency primitives, and reverting it is sufficient to fix the bug without any further effects. We are quite confident that the commit is the issue, as we performed a thorough bisect which maintained “our” code unchanged throughout.

Unfortunately, however, we are unable to reproduce this bug in any “open” architectures (such as x86-64, AArch64, etc), so are not entirely sure how we should go about reporting the bug. Additionally, it makes it difficult to open a discussion regarding whether the commit is correct (and thus we may need to modify our additions to lldb to match new implicit behaviour), as third parties may be unable to reproduce the issue. Finally, as the bug results in a deadlock (which requires a sigkill to end) we won’t (as I understand it) be able to use a “Reproducer” to demonstrate the bug to third parties.

Although we are able to “solve” the issue locally (by reverting the commit), we feel that the better solution would be to feed back our findings to the community and solve the issue, rather than (privately) sweeping it under the rug. As components of our compiler are proprietary, however, this process becomes difficult due to the reasons listed above.

To summarise, there are two main questions that I feel unable to answer:

  • Is there an existing process for reporting bugs that only affect third parties, and which cannot be reproduced in “core” targets.

I don’t believe there is a formal process for this. Though I would suggest just submitting a bug and attaching stack traces of your deadlock. Loading a core file is very similar across all targets, so I can’t imagine this being hard to reproduce with another core file? Is there something special about your core file or setup? I know that logging used to be able to cause deadlocks due to the Module::GetDescription(…) that tried to take the module lock. It no longer does this on top of tree.

  • To what extend is it possible to discuss (or report) bugs “on faith” - as in without any concrete evidence that a third party can reproduce.

We are currently looking into opening up our build process so that we are able to distribute binary libraries to enable third parties to build our compiler + debugger, but as this is currently a work-in-progress it is unfortunately not a solution to this issue.

Many thanks in advance for any and all advice.
Yours,

I would go ahead and debug the deadlock, attach repro steps for how you are loading your core file (exact commands or APIs that are being used) and then maybe attach the output “bt all” so we can see all of the threads and see what is deadlocking your LLDB.

Greg

Hi Raphael,

Thanks for the advice!

I think the best idea is to comment on the commit on Phabricator ( reviews.llvm.org ) as it seems to be a relatively recent change. Otherwise if you can somehow provide way to reproduce the deadlock using only code you can share + LLVM.org sources then filing a bug would be an option too.

I’ll definitely leave a comment then, as at the very least I should be able to get some feedback on the commit itself. I can’t (sadly) reproduce the deadlock using public code - I’m still looking into how we can share our (private) llvm/lldb dependencies so that public parties can build them, so I may hold off on filing a bug until I have sorted that.

At least the backtrace of all threads in the deadlocked state would be good to know. And of course the commit your bisect stopped at if it’s a bug report.

I can absolutely share all of these, and I’ll make sure to include them in any bug report I file.

And I believe you can’t use the reproducer feature here as that requires having the respective LLDB binary to replay (which you probably can’t share).

Our LLDB binaries are publicly available, however there are a number static libraries that we link into our LLVM backend whose source is proprietary, hence why I cannot reproduce the bug using public code.

Thanks,
Adam

Hi Greg,

Thanks for the advice!

[…] I would suggest just submitting a bug and attaching stack traces of your deadlock. Loading a core file is very similar across all targets, so I can’t imagine this being hard to reproduce with another core file?

Glad to hear this - I’ll do so soon then. I also imagine that this bug affects other “backends”, but I can’t confirm that myself (due to lack of experience with other lldb backends), so hopefully others will be able to verify it if I file a bug.

Is there something special about your core file or setup?

As I understand it there is not that much “weird” about our LLDB integrations. We have made some specific additions to be able to debug threads/processes running on our co-processor and allow printf/debugging information to be passed back to the host, but aside from that we haven’t touched any of the core code.

It is, however, possible that we’ve incorrectly subclassed one of the native thread/process classes incorrectly and violated some concurrency invariant. This is part of my hesitation for filing a bug report, as I’m not sure whether the commit itself was at fault, or whether we accidentally relied on some incorrect concurrency behaviour which has now been corrected, leaving our plugin broken.

I would go ahead and debug the deadlock, attach repro steps for how you are loading your core file (exact commands or APIs that are being used) and then maybe attach the output “bt all” so we can see all of the threads and see what is deadlocking your LLDB.

Okay, thanks for the advice regarding what would be good to include. I’ll make sure to add as much of this as I can when I file the bug report.

Thanks again,
Adam