[RFC] Heterogeneous LLVM-IR Modules

TL;DR

Hi,

Heterogeneous modules seem like an important feature when targeting accelerators.

TL;DR

Let’s allow to merge to LLVM-IR modules for different targets (with
compatible data layouts) into a single LLVM-IR module to facilitate
host-device code optimizations.

I think the main question I have is with respect to this limitation on the datalayout: isn’t it too limiting in practice?
I understand that this is much easier to implement in LLVM today, but it may get us into a fairly limited place in terms of what can be supported in the future.
Have you looked into what would it take to have heterogeneous modules that have their own DL?

Wait, what?

Given an offloading programming model of your choice (CUDA, HIP, SYCL,
OpenMP, OpenACC, …), the current pipeline will most likely optimize
the host and the device code in isolation. This is problematic as it
makes everything from simple constant propagation to kernel
splitting/fusion painfully hard. The proposal is to merge host and
device code in a single module during the optimization steps. This
should not induce any cost (if people don’t use the functionality).

But how do heterogeneous modules help?

Assuming we have heterogeneous LLVM-IR modules we can look at
accelerator code optimization as an interprocedural optimization
problem. You basically call the “kernel” but you cannot inline it. So
you know the call site(s) and arguments, can propagate information back
and forth (=constants, attributes, …), and modify the call site as
well as the kernel simultaneously, e.g., to split the kernel or fuse
consecutive kernels. Without heterogeneous LLVM-IR modules we can do all
of this, but require a lot more machinery. Given abstract call sites
[0,1] and enabled interprocedural optimizations [2], host-device
optimizations inside a heterogeneous module are really not (much)
different than any other interprocedural optimization.

[0] https://llvm.org/docs/LangRef.html#callback-metadata
[1] https://youtu.be/zfiHaPaoQPc
[2] https://youtu.be/CzWkc_JcfS0

Where are the details?

This is merely a proposal to get feedback. I talked to people before and
got mixed results. I think this can be done in an “opt-in” way that is
non-disruptive and without penalty. I sketched some ideas in [3] but
THIS IS NOT A PROPER PATCH. If there is interest, I will provide more
thoughts on design choices and potential problems. Since there is not
much, I was hoping this would be a community effort from the very
beginning :slight_smile:

[3] https://reviews.llvm.org/D84728

But MLIR, …

I imagine MLIR can be used for this and there are probably good reasons
to do so. We might not want to only to do it there with mainly the
same arguments other things are still developed on LLVM-IR level. Feel
free to ask though :slight_smile:

(+1 : MLIR is not intended to be a reason to not improve LLVM!)

[I removed all but the data layout question, that is an important topic]
> TL;DR
>> -----
>>
>> Let's allow to merge to LLVM-IR modules for different targets (with
>> compatible data layouts) into a single LLVM-IR module to facilitate
>> host-device code optimizations.
>>
>
> I think the main question I have is with respect to this limitation on the
> datalayout: isn't it too limiting in practice?
> I understand that this is much easier to implement in LLVM today, but it
> may get us into a fairly limited place in terms of what can be supported in
> the future.
> Have you looked into what would it take to have heterogeneous modules that
> have their own DL?

Let me share some thoughts on the data layouts situation, not all of which are
fully matured but I guess we have to start somewhere:

If we look at the host-device interface there has to be some agreement
on parts of the datalayout, namely what the data looks like the host
sends over and expects back. If I'm not mistaken, GPUs will match the
host in things like padding, endianness, etc. because you cannot
translate things "on the fly". That said, here might be additional
"address spaces" on either side that the other one is not matching/aware
of. Long story short, I think host & device need to, and in practice do,
agree on the data layout of the address space they use to communicate.

The above is for me a strong hint that we could use address spaces to
identify/distinguish differences when we link the modules. However,
there might be the case that this is not sufficient, e.g., if the
default alloca address space differs. In that case I don't see a reason
to not pull the same "trick" as with the triple. We can specify
additional data layouts, one per device, and if you retrieve the data
layout, or triple, you need to pass a global symbol as a "anchor". For
all intraprocedural passes this should be sufficient as they are only
interested in the DL and triple of the function they look at. For IPOs
we have to distinguish the ones that know about the host-device calls
and the ones that don't. We might have to teach all of them about these
calls but as long as they are callbacks through a driver routine I don't
even think we need to.

I'm curious if you or others see an immediate problem with both a device
specific DL and triple (optionally) associated with every global symbol.

~ Johannes

You can design APIs that call functions into external hardware that
have completely different data layout, you just need to properly pack
and unpack the arguments and results. IIUC, that's what you call
"agree on the DL"?

In an LLVM module, with the single-DL requirement, this wouldn't work.
But if we had multiple named DLs and attributes to functions and
globals tagged with those DLs, then you could have multiple DLs on the
same module, as long as their control flow never reaches the other
(only through specific API calls), it should be "fine". However, this
is hardly well defined and home to unlimited corner cases to handle.
Using namespaces would work for addresses, but other type sizes and
alignment would have to be defined anyway, then we're back to the
multiple-DL tags scenario.

Given that we're not allowing them to inline or interact, I wonder if
a "simpler" approach would be to allow more than one module per
"compile unit"? Those are some very strong quotes, mind you, but it
would "solve" the DL problem entirely. Since both modules are in
memory, perhaps even passing through different pipelines (CPU, GPU,
FPGA), we can do constant propagation, kernel specialisation and
strong DCE by identifying the contact points, but still treating them
as separate modules. In essence, it would be the same as having them
on the same module, but without having to juggle function attributes
and data layout compatibility issues.

The big question is, obviously, how many things would break if we had
two or more modules live at the same time. Global contexts would have
to be rewritten, but if each module passes on their own optimisation
pipelines, then the hardest part would be building the bridge between
them (call graph and other analysis) and keep that up-to-date as all
modules walk through their pipelines, so that passes like constant
propagation can "see" through the module barrier.

cheers,
--renato

[I removed all but the data layout question, that is an important topic]

TL;DR


Let’s allow to merge to LLVM-IR modules for different targets (with
compatible data layouts) into a single LLVM-IR module to facilitate
host-device code optimizations.

I think the main question I have is with respect to this limitation
on the
datalayout: isn’t it too limiting in practice?
I understand that this is much easier to implement in LLVM today, but it
may get us into a fairly limited place in terms of what can be
supported in
the future.
Have you looked into what would it take to have heterogeneous modules
that
have their own DL?

Let me share some thoughts on the data layouts situation, not all of
which are
fully matured but I guess we have to start somewhere:

If we look at the host-device interface there has to be some agreement
on parts of the datalayout, namely what the data looks like the host
sends over and expects back. If I’m not mistaken, GPUs will match the
host in things like padding, endianness, etc. because you cannot
translate things “on the fly”. That said, here might be additional
“address spaces” on either side that the other one is not matching/aware
of. Long story short, I think host & device need to, and in practice do,
agree on the data layout of the address space they use to communicate.

The above is for me a strong hint that we could use address spaces to
identify/distinguish differences when we link the modules. However,
there might be the case that this is not sufficient, e.g., if the
default alloca address space differs. In that case I don’t see a reason
to not pull the same “trick” as with the triple. We can specify
additional data layouts, one per device, and if you retrieve the data
layout, or triple, you need to pass a global symbol as a “anchor”. For
all intraprocedural passes this should be sufficient as they are only
interested in the DL and triple of the function they look at. For IPOs
we have to distinguish the ones that know about the host-device calls
and the ones that don’t. We might have to teach all of them about these
calls but as long as they are callbacks through a driver routine I don’t
even think we need to.

I’m curious if you or others see an immediate problem with both a device
specific DL and triple (optionally) associated with every global symbol.

Having a triple/DL per global symbols would likely solve everything, I didn’t get from your original email that this was considered.

If I understand correctly what you’re describing, the DL on the Module would be a “default” and we’d need to make the DL/triple APIs on the Module “private” to force queries to go through an API on GlobalValue to get the DL/triple?

>> [I removed all but the data layout question, that is an important topic]
>> > TL;DR
>> >> -----
>> >>
>> >> Let's allow to merge to LLVM-IR modules for different targets (with
>> >> compatible data layouts) into a single LLVM-IR module to facilitate
>> >> host-device code optimizations.
>> >>
>> >
>> > I think the main question I have is with respect to this limitation
>> on the
>> > datalayout: isn't it too limiting in practice?
>> > I understand that this is much easier to implement in LLVM today, but it
>> > may get us into a fairly limited place in terms of what can be
>> supported in
>> > the future.
>> > Have you looked into what would it take to have heterogeneous modules
>> that
>> > have their own DL?
>>
>> Let me share some thoughts on the data layouts situation, not all of
>> which are
>> fully matured but I guess we have to start somewhere:
>>
>> If we look at the host-device interface there has to be some agreement
>> on parts of the datalayout, namely what the data looks like the host
>> sends over and expects back. If I'm not mistaken, GPUs will match the
>> host in things like padding, endianness, etc. because you cannot
>> translate things "on the fly". That said, here might be additional
>> "address spaces" on either side that the other one is not matching/aware
>> of. Long story short, I think host & device need to, and in practice do,
>> agree on the data layout of the address space they use to communicate.
>>
>> The above is for me a strong hint that we could use address spaces to
>> identify/distinguish differences when we link the modules. However,
>> there might be the case that this is not sufficient, e.g., if the
>> default alloca address space differs. In that case I don't see a reason
>> to not pull the same "trick" as with the triple. We can specify
>> additional data layouts, one per device, and if you retrieve the data
>> layout, or triple, you need to pass a global symbol as a "anchor". For
>> all intraprocedural passes this should be sufficient as they are only
>> interested in the DL and triple of the function they look at. For IPOs
>> we have to distinguish the ones that know about the host-device calls
>> and the ones that don't. We might have to teach all of them about these
>> calls but as long as they are callbacks through a driver routine I don't
>> even think we need to.
>>
>> I'm curious if you or others see an immediate problem with both a device
>> specific DL and triple (optionally) associated with every global symbol.
>>
>
> Having a triple/DL per global symbols would likely solve everything, I
> didn't get from your original email that this was considered.
> If I understand correctly what you're describing, the DL on the Module
> would be a "default" and we'd need to make the DL/triple APIs on the Module
> "private" to force queries to go through an API on GlobalValue to get the
> DL/triple?

That is what I tried to describe, yes. The "patch" I posted does this
"conceptually" for the triple. You make them private or require a global
value to be passed as part of the request, same result I guess. The key
is that the DL/triple is a property of the global symbol.

I'll respond to Renato's concerns on this as part of a response to him.

Long story short, I think host & device need to, and in practice do,

>> agree on the data layout of the address space they use to communicate.
>
> You can design APIs that call functions into external hardware that
> have completely different data layout, you just need to properly pack
> and unpack the arguments and results. IIUC, that's what you call
> "agree on the DL"?

What I (tried to) describe is that you can pass an array of structs via
a CUDA memcpy (or similar) to the device and then expect it to be
accessible as an array of structs on the other side. I can imagine this
property doesn't hold for *every* programming model, but the question is
if we need to support the ones that don't have it. FWIW, I don't know if
it is worth to build up a system that can allow this property to be
missing or if it is better to not allow such systems to opt-in to the
heterogeneous module merging. I guess we would need to list the
programming models for which you cannot reasonably expect the above to
work.

> In an LLVM module, with the single-DL requirement, this wouldn't work.
> But if we had multiple named DLs and attributes to functions and
> globals tagged with those DLs, then you could have multiple DLs on the
> same module, as long as their control flow never reaches the other
> (only through specific API calls), it should be "fine". However, this
> is hardly well defined and home to unlimited corner cases to handle.
> Using namespaces would work for addresses, but other type sizes and
> alignment would have to be defined anyway, then we're back to the
> multiple-DL tags scenario.

I think that a multi-DL + multi-triple design seems like a good
candidate. I'm not sure about the corner cases you imagine but I guess
that is the nature of corner cases. And, to be fair, we haven't really
talked about much details yet. If we think there is a path forward we
could come up with restrictions and requirements. Hopefully convince
ourselves and others that it could work, or realize why not :slight_smile:

> Given that we're not allowing them to inline or interact, I wonder if
> a "simpler" approach would be to allow more than one module per
> "compile unit"? Those are some very strong quotes, mind you, but it
> would "solve" the DL problem entirely. Since both modules are in
> memory, perhaps even passing through different pipelines (CPU, GPU,
> FPGA), we can do constant propagation, kernel specialisation and
> strong DCE by identifying the contact points, but still treating them
> as separate modules. In essence, it would be the same as having them
> on the same module, but without having to juggle function attributes
> and data layout compatibility issues.
>
> The big question is, obviously, how many things would break if we had
> two or more modules live at the same time. Global contexts would have
> to be rewritten, but if each module passes on their own optimisation
> pipelines, then the hardest part would be building the bridge between
> them (call graph and other analysis) and keep that up-to-date as all
> modules walk through their pipelines, so that passes like constant
> propagation can "see" through the module barrier.

I am in doubt about the "simpler" part but it's an option. The one
disadvantage I see is that we have to change the way passes work in this
setting versus the single module setting. Or somehow pretend they are in
a single module at which point the entire separation seems to loose its
appeal. I still believe that callbacks (+IPO) can make optimization of
heterogeneous module look like the optimization of regular modules the
same way callbacks blur the line between IPO and IPO of parallel
programs e.g., across the "transitive call" performed by pthread_create.

~ Johannes

What I (tried to) describe is that you can pass an array of structs via
a CUDA memcpy (or similar) to the device and then expect it to be
accessible as an array of structs on the other side. I can imagine this
property doesn't hold for *every* programming model, but the question is
if we need to support the ones that don't have it. FWIW, I don't know if
it is worth to build up a system that can allow this property to be
missing or if it is better to not allow such systems to opt-in to the
heterogeneous module merging. I guess we would need to list the
programming models for which you cannot reasonably expect the above to
work.

Right, this is the can of worms I think we won't see before it hits
us. My main concern is that allowing for opaque interfaces to be
defined means we'll be able to do almost anything around such simple
constraints, and the code won't be heavily tested around it (because
it's really hard to test those constraints).

For example, one constraint is: functions that cross the DL barrier
(ie. call functions in other DL) must marshall the arguments in a way
that the size in bytes is exactly what the function expects, given its
DL.

This is somewhat easy to verify, but it's not enough to guarantee that
the alignment of internal elements, structure layout, padding, etc
make sense in the target. Unless we write code that pack/unpack, we
cannot guarantee it is what we expect. And writing unpack code in GPU
may not even be meaningful. And it can change from one GPU family to
another, or one API to another.

Makes sense?

I think that a multi-DL + multi-triple design seems like a good
candidate.

I agree. Multiple-DL is something that comes and goes in the community
and so far the "consensus" has been that data layout is hard enough as
it is. I've always been keen on having it, but not keen on making it
happen (and fixing all the bugs that will come with it). :smiley:

Another problem we haven't even considered is where the triple will
come from and in which form. As you know, triples don't usually mean
anything without further context, and that context isn't present in
the triple or the DL. They're lowered from the front-end in snippets
of code (pack/unpack, shift/mask, pad/store/pass pointer) or thunks
(EH, default class methods).

Once it's lowered, fine, DL should be mostly fine because everything
will be lowered anyway. But how will the user identify code from
multiple different front-ends on the same IR module? If we restrict
ourselves with a single front-end, then we'll need one front-end to
rule them all, and that would be counter productive (and fairly
limited scope for such a large change).

I fear the infrastructure issues around getting the code inside the
module will be more complicated (potentially intractable) than once we
have a multi-DL module to deal with...

I am in doubt about the "simpler" part but it's an option.

That's an understatement. :slight_smile:

But I think it's important to understand why, only if to make
multiple-DL modules more appealing.

The one disadvantage I see is that we have to change the way passes work in this
setting versus the single module setting.

Passes will already have to change, as they can't look on every
function or every call, if they're done to a different DL. Probably a
simpler change, though.

cheers,
--renato

What I (tried to) describe is that you can pass an array of structs via

>> a CUDA memcpy (or similar) to the device and then expect it to be
>> accessible as an array of structs on the other side. I can imagine this
>> property doesn't hold for *every* programming model, but the question is
>> if we need to support the ones that don't have it. FWIW, I don't know if
>> it is worth to build up a system that can allow this property to be
>> missing or if it is better to not allow such systems to opt-in to the
>> heterogeneous module merging. I guess we would need to list the
>> programming models for which you cannot reasonably expect the above to
>> work.
>
> Right, this is the can of worms I think we won't see before it hits
> us. My main concern is that allowing for opaque interfaces to be
> defined means we'll be able to do almost anything around such simple
> constraints, and the code won't be heavily tested around it (because
> it's really hard to test those constraints).
>
> For example, one constraint is: functions that cross the DL barrier
> (ie. call functions in other DL) must marshall the arguments in a way
> that the size in bytes is exactly what the function expects, given its
> DL.
>
> This is somewhat easy to verify, but it's not enough to guarantee that
> the alignment of internal elements, structure layout, padding, etc
> make sense in the target. Unless we write code that pack/unpack, we
> cannot guarantee it is what we expect. And writing unpack code in GPU
> may not even be meaningful. And it can change from one GPU family to
> another, or one API to another.
>
> Makes sense?

Kind of. I get the theoretical concern but I am questioning if we need
to support that at all. What I try to say is that for the programming
models I am aware of this is not a concern to begin with. The
accelerator actually matches the host data layout. Let's take OpenMP.
The compiler cannot know what your memory actually is because types are,
you know, just hints for the most part. So we need the devices to match
the host data layout wrt. padding, alignment, etc. or we could not copy
an array of structs from one to the other and expect it to work. CUDA,
HIP, SYCL, ... should all be the same. I hope someone corrects me if I
have some misconceptions here :slight_smile:

>> I think that a multi-DL + multi-triple design seems like a good
>> candidate.
>
> I agree. Multiple-DL is something that comes and goes in the community
> and so far the "consensus" has been that data layout is hard enough as
> it is. I've always been keen on having it, but not keen on making it
> happen (and fixing all the bugs that will come with it). :smiley:
>
> Another problem we haven't even considered is where the triple will
> come from and in which form. As you know, triples don't usually mean
> anything without further context, and that context isn't present in
> the triple or the DL. They're lowered from the front-end in snippets
> of code (pack/unpack, shift/mask, pad/store/pass pointer) or thunks
> (EH, default class methods).
>
> Once it's lowered, fine, DL should be mostly fine because everything
> will be lowered anyway. But how will the user identify code from
> multiple different front-ends on the same IR module? If we restrict
> ourselves with a single front-end, then we'll need one front-end to
> rule them all, and that would be counter productive (and fairly
> limited scope for such a large change).
>
> I fear the infrastructure issues around getting the code inside the
> module will be more complicated (potentially intractable) than once we
> have a multi-DL module to deal with...
>
>> I am in doubt about the "simpler" part but it's an option.
>
> That's an understatement. :slight_smile:
>
> But I think it's important to understand why, only if to make
> multiple-DL modules more appealing.

Fair. And I'm open to be convinced this is the right approach after all.

>> The one disadvantage I see is that we have to change the way passes work in this
>> setting versus the single module setting.
>
> Passes will already have to change, as they can't look on every
> function or every call, if they're done to a different DL. Probably a
> simpler change, though.

Again, I'm not so sure. As long as the interface is opaque, e.g., calls
from host go through a driver API, I doubt there is really a problem.

I imagine it somehow like OpenMP looks, here my conceptual model:

Let's take OpenMP.
The compiler cannot know what your memory actually is because types are,
you know, just hints for the most part. So we need the devices to match
the host data layout wrt. padding, alignment, etc. or we could not copy
an array of structs from one to the other and expect it to work. CUDA,
HIP, SYCL, ... should all be the same. I hope someone corrects me if I
have some misconceptions here :slight_smile:

All those programming models have already been made to inter-work with
CPUs like that. So, if we take the conscious decision that
accelerators' drivers must implement that transparent layer in order
to benefit from LLVM IR's multi-DL, fine.

I have no stakes in any particular accelerator, but we should make it
clear that they must implement that level of transparency to use this
feature of LLVM IR.

The "important" part is there is no direct call edge between the two
modules.

Right! This makes it a lot simpler. We just need to annotate each
global symbol with the right DL and trust that the lowering was done
properly.

What about optimisation passes? GPU code skips most of the CPU
pipeline not to break codegen later on, but AFAIK, this is done by
registering a new pass manager.

We'd need to teach passes (or the pass manager) to not throw
accelerator code into the CPU pipeline and vice-versa.

--renato

Let's take OpenMP.
The compiler cannot know what your memory actually is because types are,
you know, just hints for the most part. So we need the devices to match
the host data layout wrt. padding, alignment, etc. or we could not copy
an array of structs from one to the other and expect it to work. CUDA,
HIP, SYCL, ... should all be the same. I hope someone corrects me if I
have some misconceptions here :slight_smile:

All those programming models have already been made to inter-work with
CPUs like that. So, if we take the conscious decision that
accelerators' drivers must implement that transparent layer in order
to benefit from LLVM IR's multi-DL, fine.

I have no stakes in any particular accelerator, but we should make it
clear that they must implement that level of transparency to use this
feature of LLVM IR.

Yes. Whatever we do, it should be clear what requirements there are

for you to create a multi-target module. We can probably even verify

some of them, like the direct call edge thing.

The "important" part is there is no direct call edge between the two
modules.

Right! This makes it a lot simpler. We just need to annotate each
global symbol with the right DL and trust that the lowering was done
properly.

What about optimisation passes? GPU code skips most of the CPU
pipeline not to break codegen later on, but AFAIK, this is done by
registering a new pass manager.

That is an interesting point. We could arguably teach the (new) PM to run

different pipelines for the different devices. FWIW, I'm not even sure

we do that right now, e.g., for CUDA compilation. [long live uniformity!]

We'd need to teach passes (or the pass manager) to not throw
accelerator code into the CPU pipeline and vice-versa.

What do you mean by accelerator code? Intrinsics, vector length,

etc. should be controlled by the triple, so that should be handled.

~ Johannes

I think it's worth taking a step back here and thinking through the problem. The proposed solution makes me nervous because it is quite a significant change to the compiler flow that comes from thinking of heterogeneous optimisation as an fat LTO problem, when to me it feels more like a thin LTO problem.

At the moment, there's an implicit assumption that everything in a Module will flow to the same CodeGen back end. It can make global assumptions about cost models, can inline everything, and so on.

It sounds as if we have a couple of use cases:

  - Analysis flow between modules
  - Transforms that modify two modules

The first case is where the motivating example of constant propagation. This feels like the right approach is something like ThinLTO, where you can collect in one module the fact that a kernel is invoked only with specific constant arguments in the host module and consume that result in the target module.

The second example is what you'd need for things like kernel fusion, where you need to both combine two kernels in the target module and also modify the callers to invoke the single kernel and skip some data flow. For this, you need a kind of pass that can work over things that begin in two modules.

It seems that a less invasive change would be:

  - Use ThinLTO metadata for the first case, extend it as required.
  - Add a new kind of ModuleSetPass that takes a set of Modules and is allowed to modify both.

This avoids any modifications for the common (single-target) case, but should give you the required functionality. Am I missing something?

David

[off topic] I'm not a fan of the "reply-to-list" default.

Thanks for the feedback! More below.

TL;DR
-----

Let's allow to merge to LLVM-IR modules for different targets (with
compatible data layouts) into a single LLVM-IR module to facilitate
host-device code optimizations.

I think it's worth taking a step back here and thinking through the problem. The proposed solution makes me nervous because it is quite a significant change to the compiler flow that comes from thinking of heterogeneous optimisation as an fat LTO problem, when to me it feels more like a thin LTO problem.

At the moment, there's an implicit assumption that everything in a Module will flow to the same CodeGen back end. It can make global assumptions about cost models, can inline everything, and so on.

FWIW, I would expect that we split the module *before* the codegen stage such that the back end doesn't have to deal with heterogeneous models (right now).

I'm not sure about cost models and such though. As far as I know, we don't do global decisions anywhere but I might be wrong. Put differently, I hope we don't do global decisions as it seems quite easy to disturb the result with unrelated code changes.

It sounds as if we have a couple of use cases:

- Analysis flow between modules
- Transforms that modify two modules

Yes! Notably the first bullet is bi-directional and cyclic :wink:

The first case is where the motivating example of constant propagation. This feels like the right approach is something like ThinLTO, where you can collect in one module the fact that a kernel is invoked only with specific constant arguments in the host module and consume that result in the target module.

Except that you can have cyclic dependencies which makes this problematic again. You might not propagate constants from the device module to the host one, but if memory is only read/written on the device is very interesting on the host side. You can avoid memory copies, remove globals, etc. That is just what comes to mind right away. The proposed heterogeneous modules should not limit you to "monolithic LTO", or "thin LTO" for that matter.

The second example is what you'd need for things like kernel fusion, where you need to both combine two kernels in the target module and also modify the callers to invoke the single kernel and skip some data flow. For this, you need a kind of pass that can work over things that begin in two modules.

Right. Splitting, fusing, moving code, etc. all require you to modify both modules at the same time. Even if you only modify one module, you want information from both, either direction.

It seems that a less invasive change would be:

- Use ThinLTO metadata for the first case, extend it as required.
- Add a new kind of ModuleSetPass that takes a set of Modules and is allowed to modify both.

This avoids any modifications for the common (single-target) case, but should give you the required functionality. Am I missing something?

This is similar to what Renato suggested early on. In addition to the "ThinLTO metadata" inefficiencies outlined above, the problem I have with the second part is that it requires to write completely new passes in a different style than anything we have. It is certainly a possibility but we can probably do it without any changes to the infrastructure.

In addition to the analysis/optimization infrastructure reasons I would like to point out that this would make our toolchains a lot easier. We have some embedding of device code in host code right now (on every level) and things like LTO for all offloading models would become much easier if we distribute the heterogeneous modules instead of yet another embedding. I might be biased by the way "clang offload bundler" is used right now for OpenMP, HIP, etc. but I would very much like to replace that with a "clean" toolchain that performs as much LTO as possible, at least for the accelerator code.

I hope this makes some sense, feel free to ask questions :slight_smile:

~ Johannes

FWIW, I would expect that we split the module *before* the codegen stage
such that the back end doesn't have to deal with heterogeneous models
(right now).

Indeed. Even if the multiple targets are all supported by the same
back-end (ex. different Arm families), the target info decisions are
too ingrained in how we created the back-ends to be easy (or even
possible) to split.

I'm not sure about cost models and such though. As far as I know, we
don't do global decisions anywhere but I might be wrong. Put
differently, I hope we don't do global decisions as it seems quite easy
to disturb the result with unrelated code changes.

Target info (ex. TTI) are dependent on triple + hidden parameters
(passed down from the driver as target options), which are global.

As I said before, having multiple target triples in the source will
not change that, and we'll have to create multiple groups of driver
flags, applicable to different triples. Or we'll need to merge modules
from different front-ends, in which case this looks more and more like
LTO.

This will not be trivial to map and the data layout does not reflect
any of that.

cheers,
--renato

FWIW, I would expect that we split the module *before* the codegen stage
such that the back end doesn't have to deal with heterogeneous models
(right now).

Indeed. Even if the multiple targets are all supported by the same
back-end (ex. different Arm families), the target info decisions are
too ingrained in how we created the back-ends to be easy (or even
possible) to split.

Right, and I don't see the need to generate code "together" :wink:

I'm not sure about cost models and such though. As far as I know, we
don't do global decisions anywhere but I might be wrong. Put
differently, I hope we don't do global decisions as it seems quite easy
to disturb the result with unrelated code changes.

Target info (ex. TTI) are dependent on triple + hidden parameters
(passed down from the driver as target options), which are global.

As I said before, having multiple target triples in the source will
not change that, and we'll have to create multiple groups of driver
flags, applicable to different triples. Or we'll need to merge modules
from different front-ends, in which case this looks more and more like
LTO.

This will not be trivial to map and the data layout does not reflect
any of that.

So in addition to multiple target triples and DLs we would probably want multiple target info objects, correct?

At this point I ask myself if it wouldn't be better to make the target cpu, features, and other "hidden parameters"

explicit in the module itself. (I suggested part of that recently anyway [0].) That way we could create the

proper target info from the IR, which seems to me like something valuable even in the current single-target setting.

I mean, wouldn't that allow us to make `clang -emit-llvm` followed by `opt` behave more like a single `clang` invocation?

If so, that seems desirable :wink:

~ Johannes

[0] âš™ D80750 llvm-link: Add module flag behavior MergeTargetID

This is still not enough. Other driver flags exist, which have to do
with OS and environment issues (incl. user flags) that are not part of
the target description and can affect optimisation, codegen and even
ABI.

Some of those options apply to some targets and not others. If they
apply to all targets you have, the user might want to apply to some
but not all, and then how will this work at cmdline side?

I don't know the extent of what you can combine from all of the
existing global options into IR annotations, but my wild guess is that
it would explode the number of attributes, which is not a good thing.

--renato

At this point I ask myself if it wouldn't be better to make the target

>> cpu, features, and other "hidden parameters" explicit in the module itself.
>> (I suggested part of that recently anyway[0].) That way we could create the
>> proper target info from the IR, which seems to me like something
>> valuable even in the current single-target setting.
>
> This is still not enough. Other driver flags exist, which have to do
> with OS and environment issues (incl. user flags) that are not part of
> the target description and can affect optimisation, codegen and even
> ABI.
>
> Some of those options apply to some targets and not others. If they
> apply to all targets you have, the user might want to apply to some
> but not all, and then how will this work at cmdline side?

I can see that we want different command line options per target in the
module. Given that we probably want to allow one pass pipeline per
target, maybe we keep the options but introduce something like a
`--device=N` flag which will apply all following options to the "N'th"
pipeline. That way you could specify things like:
` ... --inline-threshold=1234 --device=2 --inline-threshold=5678`

For TTI and such, the driver would create the appropriate version for
each target and put it in the respective pipeline, as it does now, just
that there are multiple pipelines.

My idea in the last email was to put the relevant driver options
(optionally) into the IR such that you can generate TTI and friends from
the IR alone. As far as I know, this is not possible right now. Note
that this is somewhat unrelated to heterogeneous modules but would
potentially be helpful there. If we would manifest the options though,
you could ask the driver to emit IR with target options embedded, then
use `opt` and friends to work on the result (w/o repeating the flags)
while still being able to create the same TTI the driver would have
created for you in an "end-to-end" run. (I might not express this idea
properly.)

> I don't know the extent of what you can combine from all of the
> existing global options into IR annotations, but my wild guess is that
> it would explode the number of attributes, which is not a good thing.

I mean, you can put the command line string that set the options into
the first place, right? That is as long as it initially was, or maybe I
am missing something.

To recap things that might "differ" from the original proposal:
- We want multiple target triples.
- We probably want multiple data layouts.
- We probably want multiple pass pipelines, with different (cmd
line) options and such.
- We might want to make modules self contained wrt. target options
such that you can create TTI and friends w/o repeating driver
options.

~ Johannes

I mean, you can put the command line string that set the options into
the first place, right? That is as long as it initially was, or maybe I
am missing something.

Options change with time, and this would make the IR incompatible
across releases without intentionally doing so.

To recap things that might "differ" from the original proposal:
   - We want multiple target triples.
   - We probably want multiple data layouts.
   - We probably want multiple pass pipelines, with different (cmd
     line) options and such.
   - We might want to make modules self contained wrt. target options
     such that you can create TTI and friends w/o repeating driver
     options.

The extent of the separation is what made me suggest that it might be
easier, in the end, to carry multiple modules, from different
front-ends, through multiple pipelines but interacting with each
other.

I guess this is why David made a parallel with LTO, as this ends up as
being a multi-device LTO in a sense. I think that will be easier and
much less intrusive than rewriting the global context, target flags,
IR annotation, data layout assumptions, target triple parsing, target
options bundling, etc.

--renato

I mean, you can put the command line string that set the options into

>> the first place, right? That is as long as it initially was, or maybe I
>> am missing something.
>
> Options change with time, and this would make the IR incompatible
> across releases without intentionally doing so.

You could arguably be forgiving when it comes to the parsing of these so
you might loose some if you mix IR across releases but right now you
cannot express this at all. I mean, IR looks as if it captures the
entire state but not quite. As a use case, the question how to reproduce
`clang -O3` with opt comes up every month or so on the list. Let's table
this for now as it seems unrelated to this proposal.

>> To recap things that might "differ" from the original proposal:
>> - We want multiple target triples.
>> - We probably want multiple data layouts.
>> - We probably want multiple pass pipelines, with different (cmd
>> line) options and such.
>> - We might want to make modules self contained wrt. target options
>> such that you can create TTI and friends w/o repeating driver
>> options.
>
> The extent of the separation is what made me suggest that it might be
> easier, in the end, to carry multiple modules, from different
> front-ends, through multiple pipelines but interacting with each
> other.
>
> I guess this is why David made a parallel with LTO, as this ends up as
> being a multi-device LTO in a sense. I think that will be easier and
> much less intrusive than rewriting the global context, target flags,
> IR annotation, data layout assumptions, target triple parsing, target
> options bundling, etc.

It is definitively multi-device (link time) optimization. The link
time part is somewhat optional and might be misleading given the
popularity of single source programming models for accelerators. The
"thinLTO" idea would also not be sufficient for everything we hope to
do, the two module approach would be though.

What if we don't rewrite these things but still merge the modules?
Let me explain :wink:

(I use `opt` invocations below as a placeholder for the lack of a better
term but knowing it is not (only) the `opt` tool we talk about.)

The problem is that the `opt` invocation is primed for a single target,
everything (=pipeline, TTI, flags, ...) exists only once, right?
I imagine the two module approach to run two `opt` invocations, one for
each module, which we would synchronize at some point to do cross-module
optimizations. Given that we can run two `opt` invocations and we assume
a pass can work with two modules, that is two sets of everything, why do
we need the separation? From a tooling perspective I think it makes
things easier to have a single module. That said, it should not preclude
us to run two separate `opt` invocations on it. So we don't rewrite
everything but instead "just" need to duplicate all the information in
the IR such that each `opt` invocation can extract it's respective set
of values and run on the respective set of global symbols. This would
reduce the new stuff to more or less what we started with: device triple
& DL, and a way to link global symbol to a device triple & DL. It is the
two module approach but with "co-located" modules :wink:

WDYT?

~ Johannes

P.S. This is really helpful but I won't give up so easily on the idea.
If I do, I have to implement cross module optimizations and I would
rather not :wink:

I think you're being overly optimistic in hoping the "triple+DL"
representation will be enough to emulate multi-target.

It may work for the cases you care about but it will create a host of
corner cases that the community will have to maintain with no tangible
additional benefit to a large portion of it.

But I'd like to heat the opinion of others on the subject before
making up my own mind...

--renato