The AMDGPU-related passes in MLIR, most notably SerializeToHsaco, have ended up broken in the main repo a few times now - see ⚙ D108850 [LLD] Remove global state in lldCommon for example. From what I can tell, this is because
MLIR_ENABLE_ROCM_CONVERSIONS is set to
0 during the buildbot builds. This value appears to derive from the fact that AMDGPU isn’t in the list of LLVM targets during these builds.
Historically, the ROCm conversions had a dependency on HIP being present, which presumably led to them being untestable on LLVM infrastructure. However, since ⚙ D114114 [MLIR][GPU] Make the path to ROCm a runtime option, this is no longer a concern.
Would it be possible to add these passes to the default testing of new code?
I have a few builder configs in staging right now that I need to move to the main buildbot. I’ve been stuck on some Kubernetes stuff for a while and haven’t got the time or courage to dig in deeper
Any Kubernetes experts in the room?
Also, I think this is tested by our legacy infra: main builds · MLIR Core
It just does not notify anyone (other than me and couple of folks) so we catch these breakages but we’re less reactive in fixing them!
Thanks for the info, @mehdi_amini!
As a note (though this appears to be an
lld issue) my hastily-tossed-together infrastructure shows a current breakage - see last night’s
check-mlir failure. If that isn’t as public as I thought it is, do let me know.
Said breakage is issue #53475, which I’ll probably spend some time debugging - or at least making a cleaner reproducer for - today.
On that note, the main trouble with actually testing
SerializeToHsaco (as opposed to confirming that it compiles) is that it’s generally only invoked in integration tests, which only run when they detect an AMD card to run on, which then gets passed to the compiler as its
-mcpu. Would it make sense to have a test that just runs the compilation for some hard-coded
-mcpu and confirms that it at least does something?
Ideally we should have a buildbot machine with an AMD GPU to test these (like we have an Nvidia machine). I don’t have access to any, do you think AMD could provide such a machine?
I know that on my team we’ve made “yeah, we should provide somewhere for LLVM to test” noises, and I think we can find a server or two for that.
The question we never got around to asking with any urgency, however, is “what do y’all need”?
The mlir-nvidia builder: Buildbot is a 16 cores machine and it seems comfortable right now.
Otherwise it’s nothing really fancy, any Linux box will do really, and adding a bot is fairly easy: How To Add Your Build Configuration To LLVM Buildbot Infrastructure — LLVM 15.0.0git documentation
@mehdi_amini How publicly accessible does a buildbot server need to be? That is, will we need to open ports/let external folks ssh in/… or just maintain the connection back and forth with buildmaster?
The buildbot can be fully isolated, no need to open any port as far as I know: it opens up a connection to the controller and gets request from there.