Enabling ROCm conversions in pre-merge builds?

krzysz00 · January 31, 2022, 4:52pm

The AMDGPU-related passes in MLIR, most notably SerializeToHsaco, have ended up broken in the main repo a few times now - see ⚙ D108850 [LLD] Remove global state in lldCommon for example. From what I can tell, this is because MLIR_ENABLE_ROCM_CONVERSIONS is set to 0 during the buildbot builds. This value appears to derive from the fact that AMDGPU isn’t in the list of LLVM targets during these builds.

Historically, the ROCm conversions had a dependency on HIP being present, which presumably led to them being untestable on LLVM infrastructure. However, since ⚙ D114114 [MLIR][GPU] Make the path to ROCm a runtime option, this is no longer a concern.

Would it be possible to add these passes to the default testing of new code?

ChristianKuehnel · February 1, 2022, 1:10pm

@mehdi_amini FYI

mehdi_amini · February 2, 2022, 6:49am

I have a few builder configs in staging right now that I need to move to the main buildbot. I’ve been stuck on some Kubernetes stuff for a while and haven’t got the time or courage to dig in deeper
Any Kubernetes experts in the room?

mehdi_amini · February 2, 2022, 6:51am

Also, I think this is tested by our legacy infra: main builds · MLIR Core

It just does not notify anyone (other than me and couple of folks) so we catch these breakages but we’re less reactive in fixing them!

krzysz00 · February 2, 2022, 5:07pm

Thanks for the info, @mehdi_amini!

As a note (though this appears to be an lld issue) my hastily-tossed-together infrastructure shows a current breakage - see last night’s check-mlir failure. If that isn’t as public as I thought it is, do let me know.

Said breakage is issue #53475, which I’ll probably spend some time debugging - or at least making a cleaner reproducer for - today.

On that note, the main trouble with actually testing SerializeToHsaco (as opposed to confirming that it compiles) is that it’s generally only invoked in integration tests, which only run when they detect an AMD card to run on, which then gets passed to the compiler as its -mcpu. Would it make sense to have a test that just runs the compilation for some hard-coded -mcpu and confirms that it at least does something?

mehdi_amini · February 2, 2022, 5:38pm

Ideally we should have a buildbot machine with an AMD GPU to test these (like we have an Nvidia machine). I don’t have access to any, do you think AMD could provide such a machine?

krzysz00 · February 2, 2022, 6:14pm

I know that on my team we’ve made “yeah, we should provide somewhere for LLVM to test” noises, and I think we can find a server or two for that.

The question we never got around to asking with any urgency, however, is “what do y’all need”?

mehdi_amini · February 2, 2022, 6:25pm

The mlir-nvidia builder: Buildbot is a 16 cores machine and it seems comfortable right now.
Otherwise it’s nothing really fancy, any Linux box will do really, and adding a bot is fairly easy: How To Add Your Build Configuration To LLVM Buildbot Infrastructure — LLVM 16.0.0git documentation

krzysz00 · February 2, 2022, 6:37pm

That’s the document that kept eluding me, thanks!

krzysz00 · February 3, 2022, 4:52pm

@mehdi_amini How publicly accessible does a buildbot server need to be? That is, will we need to open ports/let external folks ssh in/… or just maintain the connection back and forth with buildmaster?

mehdi_amini · February 3, 2022, 7:49pm

The buildbot can be fully isolated, no need to open any port as far as I know: it opens up a connection to the controller and gets request from there.

Topic		Replies	Views
[Aborted RFC] Allowing control of what backend/architecture-handling code is built when building MLIR MLIR	6	295	August 18, 2023
[PSA] Deprecation of gpu::Serialization* passes Deprecation & Important Refactoring gpu	0	225	September 11, 2023
MLIR Buildbot configuration LLVM Dev List Archives	6	80	August 26, 2020
MLIR with empty LLVM_TARGETS_TO_BUILD MLIR	9	665	June 23, 2020
Shutting down buildbot worker mlir-nvidia Project Infrastructure buildbot	10	539	June 1, 2023

Enabling ROCm conversions in pre-merge builds?

Related topics