[GlobalISel] A Proposal for global instruction selection

Hi Eric,

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:

  • Global instruction selector.
  • Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

They are the same, sorry for the confusion. This reads, we want a global and fast instruction selector where producing the code fast and producing good code quality exercise the same basic path in the framework. I.e., producing code fast is a trimmed down version of producing good code. E.g., for fast, analysis are less precise, fewer passes are run, etc.

Excellent.

  • Shared code path for fast and good instruction selection.

But then I’m not sure starting here.

  • IR that represents ISA concepts better.
  • More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I’m not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I’m unconvinced this is that useful. Perhaps an example might help :slight_smile:

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

It’ll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into “every backend for themselves” with very little target independent unification. Outside of special purpose ports I don’t see a lot of need for this, but we’ll see. I think it’s going to take some discipline to avoid the “every backend is a large C++ project that defines everything it needs custom”.

  • Easier to maintain/understand framework, in particular legalization.
  • Self contained machine representation, no back links to LLVM IR.
  • No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I’m getting the idea that you really don’t want to work on TableGen? :wink:

Heh, that’s more a pragmatic approach. I don’t want we spend months improving TableGen before we start working on GlobalISel.
That being said, I think we should push as much thing as possible in tablegen when we are done with prototyping.

Sure.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:

  • Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don’t mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

That’s a good point and yes, it could be done as a separate project. The reason why this is here is because if we want to experiment with combine and such in the prototype, this is the kind of information we would need.

Hmm, I thought you were avoiding combine? :slight_smile:

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr?

Yes.

(Better name perhaps?).

Suggestion welcome :).

Yeah. First suggestion: Let’s leave off the r :wink:

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That’s reasonable.

  • Design Decisions *
  • The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
  • Lower the ABI as part of the translation process [final].
  • Design Questions the Prototype Addresses at the End of M1 *
  • Handling of aggregate types during the translation.
  • Lowering of switches.
  • What about Module pass for Machine pass?

Could you elaborate a bit more here?

I have quickly mentioned in my reply to Marcello why this may be interesting. Let me rephrase my answer here.
Basically, we would like to have the MachineInstr to be self contained, i.e., get rid of those back links to LLVM IR. This implies that we would need to lower globals (maybe directly to MC) as part of the translation process. Globals are not attached to function but module, therefore it seems to make sense to introduce a concept of MachineModulePass.

nod I’d like to do something about the AsmPrinter anyhow.

  • Introduce new APIs to have a clearer separation between:
  • Legalization (setOperationAction, etc.)
  • Cost/Combine related (isXXXFree, etc.)
  • Lowering related (LowerFormal, etc.)
  • What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Probably :slight_smile:

As far as the prototype I think you also need to address a few additional things:

a) Calls
Calls are probably the most important part of any new instruction selector and lowering machinery and I think that the design of the call lowering infrastructure is going to be a critical part of evaluating the prototype. You might have meant this earlier when you said Lowering related, but I wanted to make sure to call it out explicitly.

Yes, lowering of calls is definitely going to be evaluated in the prototype for this first milestone and the "lowering related” stuff was about that :).
(You’re good at deciphering messages ;)).

I try. Anyhow, glad to hear about calls.

b) Testing
It’s been covered a bit before, but being able to serialize and use for testing the various IR constructs is important. In particular, I worry about the existing MIR code as I and a few others have tried to use it for testcases and failed. I’m very interested in whatever ideas you have here, all of mine are much more invasive than I think we’d like.

Honestly I haven’t used the MIR testing infrastructure yet, but yes my impression was it is not really… mature. I would love to have some serialization mechanism for the MI that really work so that we can write those testcases more easily.
As for now, I haven’t looked into it, so I cannot share any ideas. I’ve discussed a bit with Matthias and he thinks that we might not be that far away from having MIR testing useable modulo bug fixes.

It would be helpful if you could file PR on the cases where MIR was not working for you so that we can look into it at some point.

My hope is that someone could look into it before we actually need a proper MI testing in place.

(Hidden message: If you are willing to work on the MIR testing or any other mechanism that would allow us to do MI serialization deserialization, please come forward, we need you!! :D)

Indeed, for the translation part the MIR testing is not critical since we do have the LLVM IR around.
Then, if we get rid of the LLVM IR back links, serialization should become easier and maybe MIR testing could be leverage. That being said, it may be possible that we need to start that from scratch, while taking into account what we learnt from the MIR testing.

Pretty much agree with this. I didn’t file a bug because I wasn’t sure what to say other than “this serialization wasn’t useful for making test cases”. Maybe you’ll find it more so and we can get some best practices out of it.

Thanks!

-eric

Hi Eric,

Hi Quentin,

*** Goals ***

The high level goals of the new instruction selector are:

  • Global instruction selector.
  • Fast instruction selector.

Are these separate or the same? It reads like two instruction selectors at the moment.

They are the same, sorry for the confusion. This reads, we want a global and fast instruction selector where producing the code fast and producing good code quality exercise the same basic path in the framework. I.e., producing code fast is a trimmed down version of producing good code. E.g., for fast, analysis are less precise, fewer passes are run, etc.

Excellent.

  • Shared code path for fast and good instruction selection.

But then I’m not sure starting here.

  • IR that represents ISA concepts better.
  • More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I’m not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I’m unconvinced this is that useful. Perhaps an example might help :slight_smile:

I don’t have an example off hand, but basically, any time we have to create a custom SDNode, it is useless.
Another thing, that I didn’t call out before because it is a strong statement and I don’t want to commit on that for now, is that we can have a better estimate for register pressure and thing like choosing addressing mode, since we are at a much lower level and that we can directly emit the proper memory operation or look at the actual register classes.

Those are the kind of opportunities that I envision by moving to MachineInstr level.

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

It’ll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into “every backend for themselves” with very little target independent unification. Outside of special purpose ports I don’t see a lot of need for this, but we’ll see. I think it’s going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom”.

At this point, the idea is to have the standard passes shared (i.e., IRTranslator, Legalizer, RegBankSelect, and Select) and let the targets create their own pass if they want to do more stuff. Then, if we see room for generalization, we can refactor :).

This is what happens right now with the IR passes that are target specific. Sometimes those get factored out like GlobalMerge.

I believe the same could happen with MachineInstr passes.

Now, regarding the standard passes themselves, I don’t see how the target independentness will be different than our current selector. If your concern is about canonicalization, well, yes, if targets start to mess up the generic opcodes with target specific opcodes, we may lose some of it, but, I would say, that is the point!
If targets want to mess up with canonicalization, so be it. This is a problem we have with SDISel: sometime targets fight the canonicalization and this is very hard. Now, that would be much easier :).

Anyhow, yes, this is something we need to keep an eye on, and at some point in the prototype, it will be nice to have quick targeting of the framework for other backends.

  • Easier to maintain/understand framework, in particular legalization.
  • Self contained machine representation, no back links to LLVM IR.
  • No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

Yeah, just mentioned canonicalization in my previous paragraph and the bottom line is that I don’t think canonicalization should be required for correctness.

*** Proposed Approach ***

In this section, I describe the approach I plan to pursue in the prototype and the roadmap to get there. The final design will flow out of it.

For this prototype, we purposely exclude any work to improve or use TableGen or

I’m getting the idea that you really don’t want to work on TableGen? :wink:

Heh, that’s more a pragmatic approach. I don’t want we spend months improving TableGen before we start working on GlobalISel.
That being said, I think we should push as much thing as possible in tablegen when we are done with prototyping.

Sure.

** Implications **

As part of the bring-up of the prototype, we need to extend some of the core MachineInstr-level APIs:

  • Need to remember FastMath flags for each MachineInstr.

Not orthogonal to this proposal? I don’t mind lumping it in as being able to do this is probably a good goal for the prototype at least, but it seems like being able to do this is something that could be done incrementally as a separate project?

That’s a good point and yes, it could be done as a separate project. The reason why this is here is because if we want to experiment with combine and such in the prototype, this is the kind of information we would need.

Hmm, I thought you were avoiding combine? :slight_smile:

Heh, figured someones may want to try it during the prototype timeframe :), though, what interests me here is to check how easy it is to propagate this kind of information, while going for the self contained IR.

At the end of M1, the prototype will not be able to produce code, since we would only have the beginning of the Global ISel pipeline. Instead, we will test the IRTranslator on the generic output that is produced from the tested IR.

So this would be targeting Generic MachineInstr?

Yes.

(Better name perhaps?).

Suggestion welcome :).

Yeah. First suggestion: Let’s leave off the r :wink:

Gene.ic :stuck_out_tongue:

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That’s reasonable.

The fake target or be bound to one target?

  • Design Decisions *
  • The IRTranslator is a final class. Its purpose is to move away from LLVM IR to MachineInstr world [final].
  • Lower the ABI as part of the translation process [final].
  • Design Questions the Prototype Addresses at the End of M1 *
  • Handling of aggregate types during the translation.
  • Lowering of switches.
  • What about Module pass for Machine pass?

Could you elaborate a bit more here?

I have quickly mentioned in my reply to Marcello why this may be interesting. Let me rephrase my answer here.
Basically, we would like to have the MachineInstr to be self contained, i.e., get rid of those back links to LLVM IR. This implies that we would need to lower globals (maybe directly to MC) as part of the translation process. Globals are not attached to function but module, therefore it seems to make sense to introduce a concept of MachineModulePass.

nod I’d like to do something about the AsmPrinter anyhow.

  • Introduce new APIs to have a clearer separation between:
  • Legalization (setOperationAction, etc.)
  • Cost/Combine related (isXXXFree, etc.)
  • Lowering related (LowerFormal, etc.)
  • What is the contract with the backends? Is it still “should be able to select any valid LLVM IR”?

Probably :slight_smile:

As far as the prototype I think you also need to address a few additional things:

a) Calls
Calls are probably the most important part of any new instruction selector and lowering machinery and I think that the design of the call lowering infrastructure is going to be a critical part of evaluating the prototype. You might have meant this earlier when you said Lowering related, but I wanted to make sure to call it out explicitly.

Yes, lowering of calls is definitely going to be evaluated in the prototype for this first milestone and the "lowering related” stuff was about that :).
(You’re good at deciphering messages ;)).

I try. Anyhow, glad to hear about calls.

b) Testing
It’s been covered a bit before, but being able to serialize and use for testing the various IR constructs is important. In particular, I worry about the existing MIR code as I and a few others have tried to use it for testcases and failed. I’m very interested in whatever ideas you have here, all of mine are much more invasive than I think we’d like.

Honestly I haven’t used the MIR testing infrastructure yet, but yes my impression was it is not really… mature. I would love to have some serialization mechanism for the MI that really work so that we can write those testcases more easily.
As for now, I haven’t looked into it, so I cannot share any ideas. I’ve discussed a bit with Matthias and he thinks that we might not be that far away from having MIR testing useable modulo bug fixes.

It would be helpful if you could file PR on the cases where MIR was not working for you so that we can look into it at some point.

My hope is that someone could look into it before we actually need a proper MI testing in place.

(Hidden message: If you are willing to work on the MIR testing or any other mechanism that would allow us to do MI serialization deserialization, please come forward, we need you!! :D)

Indeed, for the translation part the MIR testing is not critical since we do have the LLVM IR around.
Then, if we get rid of the LLVM IR back links, serialization should become easier and maybe MIR testing could be leverage. That being said, it may be possible that we need to start that from scratch, while taking into account what we learnt from the MIR testing.

Pretty much agree with this. I didn’t file a bug because I wasn’t sure what to say other than “this serialization wasn’t useful for making test cases”. Maybe you’ll find it more so and we can get some best practices out of it.

Fingers crossed!

Thanks for the additional feedback, I think it really helps to call all that out!
Q.

  • Shared code path for fast and good instruction selection.

But then I’m not sure starting here.

  • IR that represents ISA concepts better.
  • More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I’m not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I’m unconvinced this is that useful. Perhaps an example might help :slight_smile:

I don’t have an example off hand, but basically, any time we have to create a custom SDNode, it is useless.
Another thing, that I didn’t call out before because it is a strong statement and I don’t want to commit on that for now, is that we can have a better estimate for register pressure and thing like choosing addressing mode, since we are at a much lower level and that we can directly emit the proper memory operation or look at the actual register classes.

Those are the kind of opportunities that I envision by moving to MachineInstr level.

I guess it’ll depend on how we construct the generic MIR I guess. I’m not seeing it, but I’ll hope for some magic :slight_smile:

  • Easier to maintain/understand framework, in particular legalization.
  • Self contained machine representation, no back links to LLVM IR.
  • No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

Yeah, just mentioned canonicalization in my previous paragraph and the bottom line is that I don’t think canonicalization should be required for correctness.

Well, I mean things like “target A wants op * 2, target B wants op << 2” as far as canonicalization. I’m meaning it from a “how we look at code generation level”, nothing more. I’m also not too fussed here. We’ll see how it comes out.

Suggestion welcome :).

Yeah. First suggestion: Let’s leave off the r :wink:

Gene.ic :stuck_out_tongue:

Hahahahahaha.

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That’s reasonable.

The fake target or be bound to one target?

Bound to one target if necessary. I like that we have some code generation tests that are “every target can handle this”, but others are “let’s see what we get on target X with this input”. Both are valuable, but I don’t see a need to construct up a fake target for anything.

-eric

To give a concrete example of this:

Currently, SelectionDAG has completely generic infrastructure for expanding unaligned loads and stores into sequences of aligned loads and masks. Unfortunately, the interface is entirely push from the generic side, not pull from the target side. We hit a problem where we needed different handling of unaligned loads and stores based on the address space of the base pointer. We ended up having to duplicate a load of the SelectionDAG logic in the back end. Eventually, extending our CPU to support unaligned loads and stores proved less effort than trying to get SelectionDAG to place nicely with our constraints.

With the new design, I’d imagine that there’d be a generic ExpandUnalignedLoad function in the supporting library that any target could simply use for the places where it makes sense. On MIPS, for example, the load-word-left and load-word-right instructions need some special handling and allow you to generate quite efficient code, but other forms of unaligned load and store may want to be handled generically. Being able to have the targets choose at a fine granularity by explicitly calling into the generic code for the functionality that they need is likely to be a lot better than having to provide a load of predicates up-front and hoping that the ones that the generic infrastructure asks for match the ones that you want (currently, there’s no way to tell SelectionDAG that you want different handling for things with pointers in different address spaces and no way to specify Custom for all address spaces and then call back into the generic behaviour if you want to use it for a subset).

There’s also the issue that, because SelectionDAG is a push model, it’s very easy to get into a situation (especially with the set_cc / br_cc families) where you do a transform, SelectionDAG does the inverse transform, then calls back into the back end, which redoes the transform, which SelectionDAG undoes, and so on. I think everyone who has worked on any back end has encountered this at least once (often identified with ‘why is this one test in the test suite running forever?’).

A few other things that come to mind as being easier to generalise in this model:

- Constant island generation (the ARM constant islands pass has been copied around the place a few times)

- The expansions for ll/sc atomics (we currently do this in LLVM IR, which is the wrong place, because the right place is MIR, which doesn’t exist yet)

I’m sure that there are others.

David

Hi,

I haven’t had chance to read all of this yet, but one minor thing occurred to me during your presentation that I want to mention. At one point you mentioned deleting all the bitcast instructions since they’re equivalent to nops but this isn’t always true.

The http://llvm.org/docs/LangRef.html definition of the bitcast instruction includes this sentence:

The conversion is done as if the value had been stored to memory and read back as type ty2.

For big-endian MSA, this is equivalent to a shuffling of the bits in the register because endianness only changes the byte order within each element. The order of the elements is unaffected by endianness. IIRC, big-endian NEON is the same way.

Hi Daniel,

Thanks for pointing that out.

Cheers,
-Quentin

By more flexible we mean that targets should be able to inject target specific passes between the generic passes or replace those passes by their own.

It'll be interesting to see how this is going to be developed and how to keep the target independentness of the code generator with this new scheme. I.e. this is basically turning (in my mind) into "every backend for themselves" with very little target independent unification. Outside of special purpose ports I don't see a lot of need for this, but we'll see. I think it's going to take some discipline to avoid the "every backend is a large C++ project that defines everything it needs custom”.

At this point, the idea is to have the standard passes shared (i.e., IRTranslator, Legalizer, RegBankSelect, and Select) and let the targets create their own pass if they want to do more stuff. Then, if we see room for generalization, we can refactor :).

To give a concrete example of this:

Currently, SelectionDAG has completely generic infrastructure for expanding unaligned loads and stores into sequences of aligned loads and masks. Unfortunately, the interface is entirely push from the generic side, not pull from the target side. We hit a problem where we needed different handling of unaligned loads and stores based on the address space of the base pointer. We ended up having to duplicate a load of the SelectionDAG logic in the back end. Eventually, extending our CPU to support unaligned loads and stores proved less effort than trying to get SelectionDAG to place nicely with our constraints.

With the new design, I’d imagine that there’d be a generic ExpandUnalignedLoad function in the supporting library that any target could simply use for the places where it makes sense.

That is the idea behind the LegalizationKit.

On MIPS, for example, the load-word-left and load-word-right instructions need some special handling and allow you to generate quite efficient code, but other forms of unaligned load and store may want to be handled generically. Being able to have the targets choose at a fine granularity by explicitly calling into the generic code for the functionality that they need is likely to be a lot better than having to provide a load of predicates up-front and hoping that the ones that the generic infrastructure asks for match the ones that you want (currently, there’s no way to tell SelectionDAG that you want different handling for things with pointers in different address spaces and no way to specify Custom for all address spaces and then call back into the generic behaviour if you want to use it for a subset).

There’s also the issue that, because SelectionDAG is a push model, it’s very easy to get into a situation (especially with the set_cc / br_cc families) where you do a transform, SelectionDAG does the inverse transform, then calls back into the back end, which redoes the transform, which SelectionDAG undoes, and so on. I think everyone who has worked on any back end has encountered this at least once (often identified with ‘why is this one test in the test suite running forever?’).

A few other things that come to mind as being easier to generalise in this model:

- Constant island generation (the ARM constant islands pass has been copied around the place a few times)

- The expansions for ll/sc atomics (we currently do this in LLVM IR, which is the wrong place, because the right place is MIR, which doesn’t exist yet)

I’m sure that there are others.

Agreed, for instance, we have the load/store optimizers for both ARM and AArch64.

Thanks for the detailed example.
Q.

  • Shared code path for fast and good instruction selection.

But then I’m not sure starting here.

  • IR that represents ISA concepts better.
  • More flexible instruction selector.

Some definitions here would be good.

For IR that represents ISA concepts better, this is in opposition to SDISel or LLVM IR. In other words, the target should be able to insert target specific code (e.g., instruction, physical register) at anytime without needing some extra crust to express that (e.g., intrinsic or custom SDNode).

I’m not sure that this represents the concepts any better. Basically it means that you have less and easier target independent handling, I’m unconvinced this is that useful. Perhaps an example might help :slight_smile:

I don’t have an example off hand, but basically, any time we have to create a custom SDNode, it is useless.
Another thing, that I didn’t call out before because it is a strong statement and I don’t want to commit on that for now, is that we can have a better estimate for register pressure and thing like choosing addressing mode, since we are at a much lower level and that we can directly emit the proper memory operation or look at the actual register classes.

Those are the kind of opportunities that I envision by moving to MachineInstr level.

I guess it’ll depend on how we construct the generic MIR I guess. I’m not seeing it, but I’ll hope for some magic :slight_smile:

  • Easier to maintain/understand framework, in particular legalization.
  • Self contained machine representation, no back links to LLVM IR.
  • No change to LLVM IR.

These sound great. Would be good to get the assumptions of the legalization pass written down more explicitly as you go through this.

Agree.
For now, the assumptions are there are no illegal types, just illegal pair of operation and type. But yeah, we may need to refine when we get to the legalization.

Also things like canonicalization, etc. Just something to think about.

Yeah, just mentioned canonicalization in my previous paragraph and the bottom line is that I don’t think canonicalization should be required for correctness.

Well, I mean things like “target A wants op * 2, target B wants op << 2” as far as canonicalization. I’m meaning it from a “how we look at code generation level”, nothing more. I’m also not too fussed here. We’ll see how it comes out.

Suggestion welcome :).

Yeah. First suggestion: Let’s leave off the r :wink:

Gene.ic :stuck_out_tongue:

Hahahahahaha.

Which means that it should be serializable and testable in isolation yes?

Partly. The lowering of the body of the function will be generic, but the ABI lowering will be target specific and unless we create some kind of fake target, the tests need to be bound to one target.

That’s reasonable.

The fake target or be bound to one target?

Bound to one target if necessary. I like that we have some code generation tests that are “every target can handle this”, but others are “let’s see what we get on target X with this input”. Both are valuable, but I don’t see a need to construct up a fake target for anything.

Agreed!

Thanks Eric!
Q.

Hi Quentin,

First, thanks a lot for working on this! This is obviously a really-important problem.

One thought:

+ /// *** This is to support:
+ /// *** Self contained machine representation, no back links to LLVM IR.
+ /// Import the attribute from the IR Function.
+ AttributeSet AttributeSets; ///< Parameter attributes

I fully support better modeling of functions without ties to IR-level functions. This will allow very-late outlining, multiversioning, etc., and there are good use cases for these things. That having been said, I think we should have a narrower scope for severing MI <-> IR ties, because at least one important link will continue to exist: MMOs used to provide access to IR-level alias analysis. This is critical to good instruction scheduling, memory-access merging, etc. and replicating AA at the MI level is not feasible.

-Hal

I really don't mind the "every backend for themselves" approach. The instruction selection pass is about as target-specific as a common pass can get, and the more work the generic code tries to do, the more potential it has to be inflexible. This is not to say that a generic code will necessarily be bad, but that a target-centric approach has a better chance of working out better, even if it means that more work is required to implement instruction selection for a new target.

As someone mentioned in another email, the canonicalization currently done in the DAG combiner has a tendency to interfere with what individual targets may prefer. One example of it that I remember for Hexagon was that the LLVM IR had a combination of shifts left and right to extract a bitfield from a longer integer. Hexagon has an instruction to do that and it's quite simple to map the shifts into that instruction. The combiner, hovewer, would fold the shifts leaving only the minimum sequence of operations necessary to get the bitfield. This seems to be better from the generic point of view, but it makes it practically impossible for us to match it to the "extract" instruction, and in practice the code turns out to be worse. This is the only reason why we have the HexagonGenExtract pass---we detect the patterns in the LLVM IR and generate "extract" intrinsics before the combiner mangles them up into unrecognizable forms. The same goes for replacing ADD with OR when the bits in the operands do not overlap. We have code that specifically undoes that, since for us, if the original code had an ADD, it is pretty much always better if it remains an ADD.

There were cases in the past when we had to disable parts of CodeGenPrepare, or else it would happily promote i32 into i64 where it wasn't strictly necessary. I64 is a legal type on Hexagon, but it uses pairs of registers which, in practical terms, means that our register set is cut by half when 64-bit values are used.

On the other hand, having a relatively simple, generic IR makes it easier to simplify code that is no longer subjected to the LLVM IR's constraints (e.g. getelementptr expressed as +/*, etc.). Hexagon has a lot of very specific complex/compound instructions and a given code can be written in many different ways. This makes it harder to optimize code after the specific instructions have been selected. For example, a pass that would try to simplify arithmetic code would need to deal with the dozens of variants of add/multiplication instructions, instead of simply looking at some generic GADD/GMPY.

-Krzysztof

From: "Krzysztof Parzyszek via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Monday, November 30, 2015 9:30:40 AM
Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global instruction selection

> It'll be interesting to see how this is going to be developed and
> how to
> keep the target independentness of the code generator with this new
> scheme. I.e. this is basically turning (in my mind) into "every
> backend
> for themselves" with very little target independent unification.

I really don't mind the "every backend for themselves" approach. The
instruction selection pass is about as target-specific as a common
pass
can get, and the more work the generic code tries to do, the more
potential it has to be inflexible. This is not to say that a generic
code will necessarily be bad, but that a target-centric approach has
a
better chance of working out better, even if it means that more work
is
required to implement instruction selection for a new target.

As someone mentioned in another email, the canonicalization currently
done in the DAG combiner has a tendency to interfere with what
individual targets may prefer. One example of it that I remember for
Hexagon was that the LLVM IR had a combination of shifts left and
right
to extract a bitfield from a longer integer. Hexagon has an
instruction
to do that and it's quite simple to map the shifts into that
instruction. The combiner, hovewer, would fold the shifts leaving
only
the minimum sequence of operations necessary to get the bitfield.
This
seems to be better from the generic point of view, but it makes it
practically impossible for us to match it to the "extract"
instruction,
and in practice the code turns out to be worse. This is the only
reason
why we have the HexagonGenExtract pass---we detect the patterns in
the
LLVM IR and generate "extract" intrinsics before the combiner mangles
them up into unrecognizable forms.

I sympathize with this, but this is not uniquely a backend consideration. Even though we generally canonicalize these patterns at the IR level is a roughly consistent way, if the backend is not really robust in its matching logic, you'll miss a lot just from input IR differences. There's ~1K lines of code in PPCISelDAGToDAG.cpp (look for BitPermutationSelector) to match these in a robust way, and I don't see any good way around that.

The same goes for replacing ADD
with
OR when the bits in the operands do not overlap.

This is true for other targets too, at least when the ADD is part of an addressing expression. PowerPC, for example, has code in various places to recognize ORs, in combination with some known-bits information, as surrogates for ADDs. It seems like, globally, we could do a better job here.

We have code that
specifically undoes that, since for us, if the original code had an
ADD,
it is pretty much always better if it remains an ADD.

There were cases in the past when we had to disable parts of
CodeGenPrepare, or else it would happily promote i32 into i64 where
it
wasn't strictly necessary. I64 is a legal type on Hexagon, but it
uses
pairs of registers which, in practical terms, means that our register
set is cut by half when 64-bit values are used.

This is a common problem, but is not unique to CGP. Parts of the mid-level optimizer (e.g. IndVarSimplify) also do integer promotion in inopportune way for some targets. Also, CGP has a lot of target hooks to turn off things like this, and this should certainly be optional (in practice, this widening is sometimes information destroying, and thus, not always reversible).

On the other hand, having a relatively simple, generic IR makes it
easier to simplify code that is no longer subjected to the LLVM IR's
constraints (e.g. getelementptr expressed as +/*, etc.). Hexagon has
a
lot of very specific complex/compound instructions and a given code
can
be written in many different ways. This makes it harder to optimize
code after the specific instructions have been selected. For
example, a
pass that would try to simplify arithmetic code would need to deal
with
the dozens of variants of add/multiplication instructions, instead of
simply looking at some generic GADD/GMPY.

I have mixed feelings about this. I agree that sometimes we do too much, and sometimes we make transformation where we should be providing better analysis instead, but I think there is still a lot of value in the common backend optimizations.

In this context, it is worth thinking about the various reasons why these opportunities exist in the first place (not a complete list):

1. The process of lowering GEPs (and some other relatively-higher-level IR constructs) into instructions that represent explicitly the underlying computations being performed (accounting for target capabilities) expose generic peephole/CSE opportunities.

2. The process of introducing target-specific nodes to represent partial behaviors introduces generic peephole/CSE opportunities. What I mean by this is, for example, when a floating-point division or sqrt is lowered to a target-specific reciprocal estimate function plus some Newton iterations, those Newton iterations are generic floating-point adds, multiplies, etc. that can be further optimized.

3. The process of type/operation legalization, especially when operations are split, promoted and/or turned into procedures involving stack loads and stores, present opportunities not apparent at the IR level.

4. Some generic optimizations, such as store merging, need more-detailed cost information than is available through TTI, and so are done in the backend.

In short, within our current framework, there are many reasons why we have DAGCombine and related code, and I think that while moving specific aspects into the targets might be good overall, leaving all of that to each target is too much. The fact that you can get a reasonable set of backend optimizations for normalish targets is a strong point of LLVM.

Thanks again,
Hal

Hi Hal,

The alias information is a good example of the MI to IR back links, thanks for pointing that out.

Hi Quentin,

First, thanks a lot for working on this! This is obviously a really-important problem.

One thought:

  • /// *** This is to support:
  • /// *** Self contained machine representation, no back links to LLVM IR.
  • /// Import the attribute from the IR Function.
  • AttributeSet AttributeSets; ///< Parameter attributes

I fully support better modeling of functions without ties to IR-level functions. This will allow very-late outlining, multiversioning, etc., and there are good use cases for these things. That having been said, I think we should have a narrower scope for severing MI <-> IR ties, because at least one important link will continue to exist: MMOs used to provide access to IR-level alias analysis. This is critical to good instruction scheduling, memory-access merging, etc. and replicating AA at the MI level is not feasible.

Honestly, although I understand why we have the MMOs right now, I don’t think this is a clean design and I would rather have an AA working at MI level or, better, a different way of passing the information to MI.
I don’t have something in mind on how to pass the information if we choose that path, but I think it would be important to get rid of the MI → IR link for aliases purposes, because we end up with, IMHO, ugly code where a Machine pass patches the IR to fix the alias information. E.g., in the stack coloring pass:

// AA might be used later for instruction scheduling, and we need it to be
// able to deduce the correct aliasing releationships between pointers
// derived from the alloca being remapped and the target of that remapping.
// The only safe way, without directly informing AA about the remapping
// somehow, is to directly update the IR to reflect the change being made
// here.
Instruction *Inst = const_cast<AllocaInst *>(To);
if (From->getType() != To->getType()) {
BitCastInst *Cast = new BitCastInst(Inst, From->getType());
Cast->insertAfter(Inst);
Inst = Cast;
}

Therefore, I would prefer having the alias information expressed as something decoupled from the IR and that could be updated.

What do you think?

Cheers,
-Quentin

From: "Quentin Colombet" <qcolombet@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Monday, November 30, 2015 1:34:20 PM
Subject: Re: [llvm-dev] [GlobalISel] A Proposal for global
instruction selection

Hi Hal,

The alias information is a good example of the MI to IR back links,
thanks for pointing that out.

> Hi Quentin,

> First, thanks a lot for working on this! This is obviously a
> really-important problem.

> One thought:

> + /// *** This is to support:

> + /// *** Self contained machine representation, no back links to
> LLVM IR.

> + /// Import the attribute from the IR Function.

> + AttributeSet AttributeSets; ///< Parameter attributes

> I fully support better modeling of functions without ties to
> IR-level
> functions. This will allow very-late outlining, multiversioning,
> etc., and there are good use cases for these things. That having
> been said, I think we should have a narrower scope for severing MI
> <-> IR ties, because at least one important link will continue to
> exist: MMOs used to provide access to IR-level alias analysis. This
> is critical to good instruction scheduling, memory-access merging,
> etc. and replicating AA at the MI level is not feasible.

Honestly, although I understand why we have the MMOs right now, I
don’t think this is a clean design and I would rather have an AA
working at MI level or, better, a different way of passing the
information to MI.
I don’t have something in mind on how to pass the information if we
choose that path, but I think it would be important to get rid of
the MI -> IR link for aliases purposes, because we end up with,
IMHO, ugly code where a Machine pass patches the IR to fix the alias
information. E.g., in the stack coloring pass:

// AA might be used later for instruction scheduling, and we need it
to be
// able to deduce the correct aliasing releationships between
pointers
// derived from the alloca being remapped and the target of that
remapping.
// The only safe way, without directly informing AA about the
remapping
// somehow, is to directly update the IR to reflect the change being
made
// here.
Instruction *Inst = const_cast<AllocaInst *>(To);
if (From->getType() != To->getType()) {
BitCastInst *Cast = new BitCastInst(Inst, From->getType());
Cast->insertAfter(Inst);
Inst = Cast;
}

Therefore, I would prefer having the alias information expressed as
something decoupled from the IR and that could be updated.

What do you think?

I certainly agree that it is ugly, and as the author of the comment you've highlighted above, I would love to have a better solution. The only problem is that I don't have a good idea how such a solution might work; prerecording all N^2 possible aliasing queries is impractical. I don't think that rewriting our current AA to work on MI, or even refactoring the current AA logic to work in terms of abstractions over both IR and MI is really possible, because we need it to function even after MI has dropped out of SSA form and PHI elimination has happened.

That having been said, prerecording query results still seems like the best solution, but it can't be naive (N^2). We have to understand the constraints that the query results will only be used to disambiguate otherwise-undecidable aliasing queries for the purpose of doing merging, scheduling, etc., and maybe within those use cases, we can constrain the problem enough to make prerecording the query results practical.

The bad news, however, is that prerecording the query results does not remove the comment in stack coloring, but just changes it to discuss operating on some MI-level data structure that is not the IR.

Thanks again,
Hal

Hi David,

I had a quick look at the inttoptr/ptrtoint thing for GlobalISel and unless I am mistaken the semantic you want for such instructions do not match what the language reference says.

Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:
“The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”

The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.
I plan to stick to the LLVM IR semantic for the generic opcode of GlobalISel and thus, it does not seem useful to have INTTOPTR like nodes around.

For instance, AArch64 has the TBI feature to deal more efficiently with fat pointer when accessing memory and the masking operations are explicitly set in the LLVM IR and later combine with the memory accesses.

Anyhow, this is just a heads-up, we will see in due time what we can do here.

Cheers,
-Quentin

Hi Daniel,

I had a quick look at the language reference for bitcast and I have a different reading than what you were pointing out.
Indeed, my take away is:
"It is always a no-op cast because no bits change with this conversion."

In other words, deleting all bitcast instructions should be fine.

My understanding of the quote you’ve highlighted is that it tells C programmers that this is like a memcpy, not a cast :).

Cheers,
-Quentin

I believe that this is somewhere where the IR specification needs to evolve. Currently, we have no in-tree architectures where pointers are not integers and so that definition is appropriate. Adding a new pair of IR instructions for integer-to-pointer and pointer-to-integer conversions and not calling them inttoptr / ptrtoint is likely to be far more confusing.

David

I think this would deserve its own RFC email thread.

Indeed, you said that inttoptr instruction is not a no-op on your architecture, whereas the language reference says:
“The ‘inttoptr‘ instruction converts value to type ty2 by applying either a zero extension or a truncation depending on the size of the integer value. If value is larger than the size of a pointer then a truncation is done. If value is smaller than the size of a pointer then a zero extension is done. If they are the same size, nothing is done (no-op cast).”

The bottom line is that IMHO, if you rely on inttoptr/ptrtoint instructions to do the conversion from fat pointers to plain integers you are abusing the IR.

I believe that this is somewhere where the IR specification needs to evolve. Currently, we have no in-tree architectures where pointers are not integers and so that definition is appropriate. Adding a new pair of IR instructions for integer-to-pointer and pointer-to-integer conversions and not calling them inttoptr / ptrtoint is likely to be far more confusing.

I think this would deserve its own RFC email thread.

+1

Q.

FYI, we need something very similar for GC purposes and are going to make a proposal along these lines within the next week or two. We not yet to the point of having a “final” proposal, but we’re planning on starting with some initial experimental support, prototyping on ToT, and evolving the spec language as we need to. Details to follow separately. David, you and I should probably talk offline to make sure what we’re thinking about works for you as well.

Hi,

It was a comment by Tim that first made me aware of it (see http://lists.llvm.org/pipermail/llvm-dev/2013-August/064714.html but I think he commented on one of my patches before that).

I asked about it on llvm-dev a couple weeks later (http://lists.llvm.org/pipermail/llvm-dev/2013-August/064919.html) highlighting the contradiction and was told that ‘no-op cast’ referred to the lack of math rather than a requirement that zero instructions are used. It’s therefore my understanding that shuffling the bits to preserve the load/store based definition isn’t considered to be changing the bits.

I think the main thing the current definition is unclear on is whether it refers to the bits in a physical machine register or the bits in the LLVM-IR virtual register. Most of the time these two views are the same but this doesn’t quite work for big-endian MSA/NEON. For example:

%0 = bitcast <4 x i32> <i32 1, i32 2, i32 3, i32 4> to <2 x i64>

%0 = <2 x i64> <i64 (1 << 32) | 2, i64 (3 << 32) | 4>

are equivalent to each other in LLVM-IR terms but the constants are physically laid out in MSA registers as:

0x00000004000000030000000200000001 # <4 x i32> <i32 1, i32 2, i32 3, i32 4>

0x00000003000000040000000100000002 # <2 x i64> <i64 (1 << 32) | 2, i64 (3 << 32) | 4>

and we must therefore shuffle the bits to preserve LLVM-IR’s point of view.