Upstreaming basic support for accelerators

Meta and NVIDIA have been working together on a couple of branches that provide basic support for debugging AMD and NVIDIA GPUs:

We have reached a state in which LLDB can be used for debugging kernels very decently but with some restrictions. For NVIDIA, the restrictions are: one kernel at a time, <20-30K threads (otherwise it gets too slow; fixable by adding denser gdb-remote packets, etc.), no step-over (you can only use breakpoints and do resume for now). At least for debugging small to medium kernels, this is already very useful. There’s still a lot of work ahead, but we think that now is a good moment to start upstreaming our changes, start getting contributions and ideas, foster collaboration within our own companies (yep, not everyone internally can use our fork directly), and at the same time have a lower rebase burden internally.

The good news is that so far we have made very few changes to the core of LLDB and 95% of our code is in the form of plugins. So I have a good feeling about the upstreamability of this. I will summarize the most important changes to get some early feedback and avoid wasting cycles during code review.

  • lldb-server plugins: We added this feature that allows a CPU target to initialize an lldb-server plugin based on some event. For example, in the NVIDIA case, if the CPU target hits a certain breakpoint (cuda_init) then the corresponding plugin will get initialized. We are using this plugin to create a secondary accelerator target that debugs only the GPU. Plugins can be disabled and enable by the user as well, so that you pay the initialization cost of the plugin if you really want to use it.
  • accelerator actions: The cpu and accelerator targets can execute actions on each other usin gaccelerator actions. They come in a few flavors: e.g. synchronize the targets upon a stop, set a breakpoint, fetch some symbols upon a breakpoint is hit, etc. They are meant to simplify the complex interactions between targets.
  • address space support: accelerators like gpus often have multiple address spaces, which means that whenever data needs to be read, an address space identifier needs to be set. Each accelerator plugin provides an address space spec that is used by lldb to do the correct data fetching and dwarf parsing. We also had some changes to dwarf expressions to make use of these spaces but they are very localized.
  • mock accelerator plugin: NVIDIA and AMD support need to link lldb-server with some specific libraries, which makes generic testing a problem. Because of this, we created a mock accelerator plugin that can test most codepaths without relying on any external library.

With this, we have been able to enable debugging of accelerators by having a target for the cpu and one target for the accelerator. We could theoretically support multiple accelerators at once for the same target, but we don’t have a good use case for that yet. We also didn’t want to mix the cpu and the accelerator in a single target because that opens a can of worms: the platforms are different; they both might get events simultaneously from signals or the driver making synchronization a tad difficult if they are the same target; the way you navigate through each set of threads is different. For example, it’s common to traverse over CPU threads by ids, but for GPUs, you traverse by coordinates, such as blockIdx (1, 2, 123) threadIdx (0, 0, 100). It’s relatively easy to add custom behavior to the user interaction via Platform plugins when the targets are independent. Not only that, we also want to support accelerator-only debugging as well, in which case the CPU target is dettached upon the initialization of the accelerator.

The long term vision we have for this is that we want to add the necessary foundation so that anyway can enable their own accelerator in lldb. And besides that, we want to eventually add search and filtering capabilities to LLDB. In fact, when doing GPU debugging, you are dealing with potentially more than a million threads, so the big question is how can the debugger help you search quickly and identify the threads that satisfy certain conditions you care about, or how can you narrow down the set of threads that are interesting to you as a way to reduce noise. We have a few ideas, like adding a query language that can optimize searches by executing some actions on the server and some others in the client. But we don’t want to overthink this as we’d like first to have some widespread user feedback to figure out solutions to these problems.

As a final note, we’ve decided to use the name accelerator instead of GPU because we are experimenting with non-gpu architectures as well, and making this generic is a good idea anyway. If you can suggest a better name with a convincing reason, that would be very nice as well!

Any feedback or ideas are welcome. I plan to start upstreaming this with the Meta folks soon.

9 Likes

Thanks for starting this discussion, Walter. I appreciate the context. Here are my initial thoughts:

  • The plugin approach sounds promising and will certainly help with upstreaming and reviewing.
  • Separate targets for the CPU and the accelerator are the right call. I really can’t imagine that working out any other way.
  • I’m very excited about address space support. I’d love to hear more. At this point, it sounds limited to the accelerators, but we have other use cases for it in LLDB (like Wasm), so hopefully, we can find a way to thread this through all of LLDB if it isn’t already.
  • I’d also love to hear more about the protocol extensions you have or are considering.

I do have a few questions, mostly requests for more details:

  1. Do you have a plan for how you’ll be upstreaming this? The context here is helpful, but for someone that’s not been involved in this, it’s not immediately clear what the different steps and deliverables are.
  2. Can you share a bit more about the testing strategy? You mention the mock accelerators. Does that mean that most testing is done by debugging an lldb-server that runs one of those mock plugins? Do you plan on setting up any bots with real hardware?