Correct modelling of instructions with types smaller than the register class

Hi Quentin, Hi Amara,

I was following your discussion on D75086 regarding declaring types as legal even if they are smaller than the actual register class (e.g. s16 and gpr32). We are working on a backend which only has 32 and 64-bit registers and we recently had a problem regarding exactly this where we had to declare G_UNMERGE_VALUES and G_MERGE_VALUES with a smaller type of <s32 as legal, even though we don't have a register class that matches and instruction selection implicitly sign extends them to s32. The reasoning was that for example in the case of G_UNMERGE_VALUES, the legalizer widens the source type when widening the destination type, which we don't want (as the big type then doesn't fit our registers anymore). So we decided to only make sure that the bigger type matches a register class and simply allowed smaller types <s32.

What I am gathering now from this discussion is that this is basically an incorrect modelling and only works around another problem. Could you expand on this in more detail?

I was thinking about implementing some custom legalization which uses G_EXTRACT or shifts to get what I want, but G_EXTRACT extracts as many bits as the destination type, so I would end up with the same problem again. Shifts are also not optimal, since we have a target instruction available to extract n bits from a given offset. Is it valid to emit target instructions during legalization, i.e. before instruction selection, or would the best modelling be to just use the shifts and then use some pre-instruction-selection combiner to merge them to the desired target instruction?

Best regards,

Dominik

Hi Dominik,

I’ll do a brief reply here and if you want more information we can talk further :).

Hi Quentin, Hi Amara,

I was following your discussion on D75086 regarding declaring types as legal even if they are smaller than the actual register class (e.g. s16 and gpr32). We are working on a backend which only has 32 and 64-bit registers and we recently had a problem regarding exactly this where we had to declare G_UNMERGE_VALUES and G_MERGE_VALUES with a smaller type of <s32 as legal, even though we don't have a register class that matches and instruction selection implicitly sign extends them to s32. The reasoning was that for example in the case of G_UNMERGE_VALUES, the legalizer widens the source type when widening the destination type, which we don't want (as the big type then doesn't fit our registers anymore). So we decided to only make sure that the bigger type matches a register class and simply allowed smaller types <s32.

What I am gathering now from this discussion is that this is basically an incorrect modelling and only works around another problem. Could you expand on this in more detail?

A few years ago we discussed the possibility and/or desire of allowing registers to be larger than the actual legalized type or whether they should be in sync with the legalized type.

For instance, consider
s8 = G_AND s8, s8

If your target supports `s32 = G_AND` then allowing registers to be larger than legalized type would yield that `s8 = G_AND s8, s8` is legal (and really any type smaller than or equal to s32 would be legal for that instruction).

Then the question was how do we represent that: the relevant bits are the low 8 bits, but the container is 32-bit.
E.g.,
s8(32) = G_AND s8(32), s8(32)

We explored that a little bit and found that it was dangerous to carry undefined bits around (the upper 24 bits in that case) and that there was no real use case for this. Tim (Northover) might remember more details (CC’ed).

Therefore, we decided that the legal types should exactly match what the container (the register) will be. So in theory, if you’re not following that model, you’re in violation of that and that’s where problems arise (and thus D75086 to workaround them).

I was thinking about implementing some custom legalization which uses G_EXTRACT or shifts to get what I want, but G_EXTRACT extracts as many bits as the destination type, so I would end up with the same problem again. Shifts are also not optimal, since we have a target instruction available to extract n bits from a given offset. Is it valid to emit target instructions during legalization,

Yes, that’s completely legal.

The whole idea of GISel is that the backends should be able to insert target specific instructions whenever they want. For instance, we use that extensively in the lowering of calls directly in the IRTranslator, i.e., at the very beginning of GISel.

i.e. before instruction selection, or would the best modelling be to just use the shifts and then use some pre-instruction-selection combiner to merge them to the desired target instruction?

That’s a question I cannot answer directly, it really depends on what are the implementation trade-offs on your end.

I have a personal preference for generic shifts as it is possible that other generic optimizations can get rid of them. Now, if those shifts are too difficult to combine away or if you know that they are not optimizable anyway, you may want to go directly with the target specific instruction.

Just keep in mind that whenever you use a target specific instruction, this is just basically an opaque object that the generic optimizations have to deal with.

Cheers,
-Quentin

Hi Quentin,

thank you for the reply! This clears up a lot of the questions I was having. It seems like we should definitely invest some time in rewriting some of our legalization rules then! I also posted some questions further down below. I would appreciate getting your opinion on them.

Hi Dominik,

I’ll do a brief reply here and if you want more information we can talk further :).

Hi Quentin, Hi Amara,

I was following your discussion on D75086 regarding declaring types as legal even if they are smaller than the actual register class (e.g. s16 and gpr32). We are working on a backend which only has 32 and 64-bit registers and we recently had a problem regarding exactly this where we had to declare G_UNMERGE_VALUES and G_MERGE_VALUES with a smaller type of <s32 as legal, even though we don't have a register class that matches and instruction selection implicitly sign extends them to s32. The reasoning was that for example in the case of G_UNMERGE_VALUES, the legalizer widens the source type when widening the destination type, which we don't want (as the big type then doesn't fit our registers anymore). So we decided to only make sure that the bigger type matches a register class and simply allowed smaller types <s32.

What I am gathering now from this discussion is that this is basically an incorrect modelling and only works around another problem. Could you expand on this in more detail?

A few years ago we discussed the possibility and/or desire of allowing registers to be larger than the actual legalized type or whether they should be in sync with the legalized type.

For instance, consider
s8 = G_AND s8, s8

If your target supports `s32 = G_AND` then allowing registers to be larger than legalized type would yield that `s8 = G_AND s8, s8` is legal (and really any type smaller than or equal to s32 would be legal for that instruction).

Then the question was how do we represent that: the relevant bits are the low 8 bits, but the container is 32-bit.
E.g.,
s8(32) = G_AND s8(32), s8(32)

We explored that a little bit and found that it was dangerous to carry undefined bits around (the upper 24 bits in that case) and that there was no real use case for this. Tim (Northover) might remember more details (CC’ed).

Therefore, we decided that the legal types should exactly match what the container (the register) will be. So in theory, if you’re not following that model, you’re in violation of that and that’s where problems arise (and thus D75086 to workaround them).

This leads me to another question. What would the correct modelling look like for special registers that cannot be read/written, like a carry-bit register? The G_UADDO/UADDE instructions have an s1 carry and our target only supports reading this special register, but not writing to it (at least not directly and without doing some hacks). We found this especially hard to model, since GlobalISel doesn't really have a notion of special registers in generic instructions. In the end we went with an approach of defining a pseudo-register-class, which is not allocatable and copyable and emit a COPY to/from our physical carry bit register. This generally works the way we want and the COPY is eliminated. However it is not eliminated when using O0.

I guess this probably has something to do with not defining an extra register bank and teaching our regbank info that such a carry-bit vreg should be assigned to this regbank? Right now it is simply assigned to the GPR bank, which seems to be the problem for O0.

Nevertheless I still dislike this approach as I don't want to rely on the COPY being eliminated. I would much rather not emit it at all, but this obviously means that I would throw away the carry vreg of G_UADDO/UADDE, potentially changing the semantics of the instruction.

I was thinking about implementing some custom legalization which uses G_EXTRACT or shifts to get what I want, but G_EXTRACT extracts as many bits as the destination type, so I would end up with the same problem again. Shifts are also not optimal, since we have a target instruction available to extract n bits from a given offset. Is it valid to emit target instructions during legalization,

Yes, that’s completely legal.

The whole idea of GISel is that the backends should be able to insert target specific instructions whenever they want. For instance, we use that extensively in the lowering of calls directly in the IRTranslator, i.e., at the very beginning of GISel.

That is very good to know! Thanks for the clarification.

i.e. before instruction selection, or would the best modelling be to just use the shifts and then use some pre-instruction-selection combiner to merge them to the desired target instruction?

That’s a question I cannot answer directly, it really depends on what are the implementation trade-offs on your end.

I have a personal preference for generic shifts as it is possible that other generic optimizations can get rid of them. Now, if those shifts are too difficult to combine away or if you know that they are not optimizable anyway, you may want to go directly with the target specific instruction.

Just keep in mind that whenever you use a target specific instruction, this is just basically an opaque object that the generic optimizations have to deal with.

Hm, good point. It should be a straightforward combine rule, so going with the shifts and the combiner seems to be a better approach at first glance. Thanks!

Best regards,

Dominik

Hi Dominik,

Hi Quentin,

thank you for the reply! This clears up a lot of the questions I was having. It seems like we should definitely invest some time in rewriting some of our legalization rules then! I also posted some questions further down below. I would appreciate getting your opinion on them.

Hi Dominik,

I’ll do a brief reply here and if you want more information we can talk further :).

Hi Quentin, Hi Amara,

I was following your discussion on D75086 regarding declaring types as legal even if they are smaller than the actual register class (e.g. s16 and gpr32). We are working on a backend which only has 32 and 64-bit registers and we recently had a problem regarding exactly this where we had to declare G_UNMERGE_VALUES and G_MERGE_VALUES with a smaller type of <s32 as legal, even though we don’t have a register class that matches and instruction selection implicitly sign extends them to s32. The reasoning was that for example in the case of G_UNMERGE_VALUES, the legalizer widens the source type when widening the destination type, which we don’t want (as the big type then doesn’t fit our registers anymore). So we decided to only make sure that the bigger type matches a register class and simply allowed smaller types <s32.

What I am gathering now from this discussion is that this is basically an incorrect modelling and only works around another problem. Could you expand on this in more detail?

A few years ago we discussed the possibility and/or desire of allowing registers to be larger than the actual legalized type or whether they should be in sync with the legalized type.

For instance, consider
s8 = G_AND s8, s8

If your target supports s32 = G_AND then allowing registers to be larger than legalized type would yield that s8 = G_AND s8, s8 is legal (and really any type smaller than or equal to s32 would be legal for that instruction).

Then the question was how do we represent that: the relevant bits are the low 8 bits, but the container is 32-bit.
E.g.,
s8(32) = G_AND s8(32), s8(32)

We explored that a little bit and found that it was dangerous to carry undefined bits around (the upper 24 bits in that case) and that there was no real use case for this. Tim (Northover) might remember more details (CC’ed).

Therefore, we decided that the legal types should exactly match what the container (the register) will be. So in theory, if you’re not following that model, you’re in violation of that and that’s where problems arise (and thus D75086 to workaround them).

This leads me to another question. What would the correct modelling look like for special registers that cannot be read/written, like a carry-bit register? The G_UADDO/UADDE instructions have an s1 carry and our target only supports reading this special register, but not writing to it (at least not directly and without doing some hacks).

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.
The register class could be "not allocatable” (see CCR in AArch64 for instance).
Note, that if that class is unallocable, you’ll have to manage manually any overlapping that could arise.
E.g.,
s32 …, s1 carry1 = G_UADDO
s32 …, s1 carry2 = G_UADDO
s32 extended_carry1 = zext s1 carry1 <— carry1 will be destroyed by the second G_UADDO, you’ll have to fix the schedule or copy the value manually.
s32 extended_carry2 = zext s1 carry2
icmp s32 extended_carry1, extended_carry2

Then when you do instruction selection, just map the register to the physical register. (You may want to do the fix-ups I mentioned here.)

We found this especially hard to model, since GlobalISel doesn’t really have a notion of special registers in generic instructions. In the end we went with an approach of defining a pseudo-register-class, which is not allocatable and copyable and emit a COPY to/from our physical carry bit register. This generally works the way we want and the COPY is eliminated. However it is not eliminated when using O0.

I guess this probably has something to do with not defining an extra register bank and teaching our regbank info that such a carry-bit vreg should be assigned to this regbank? Right now it is simply assigned to the GPR bank, which seems to be the problem for O0.

Yeah, I think you’d want a different reg bank here. Like you said this is a problem if the copies don’t get optimized away.

Nevertheless I still dislike this approach as I don’t want to rely on the COPY being eliminated. I would much rather not emit it at all, but this obviously means that I would throw away the carry vreg of G_UADDO/UADDE, potentially changing the semantics of the instruction.

Generally speaking, I think you would need a copy (or zext/sext) to move the value in general purpose land (like in my example with zext). If you don’t have such use, the copy wouldn’t be emitted and there is nothing to do.

To be honest, I think the situation is complicated today because GISel lack the fix-up that I mentioned. I am actually surprised that didn’t come up as an issue yet (x86 is full of these flags). Going forward, I think it should just come with the framework: you assign a physreg selection time, GISel copies it around/reschedule appropriately to preserve the value. There’s a lot of wishful thinking here x).

Cheers,
-Quentin

Hi Quentin,

thank you for the reply! I have a couple more questions that came up when I tried to implement this today. I hope you can help me out with this again!

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.
The register class could be "not allocatable” (see CCR in AArch64 for instance).
Note, that if that class is unallocable, you’ll have to manage manually any overlapping that could arise.
E.g.,
s32 ..., s1 carry1 = G_UADDO
s32 ..., s1 carry2 = G_UADDO
s32 extended_carry1 = zext s1 carry1 <— carry1 will be destroyed by the second G_UADDO, you’ll have to fix the schedule or copy the value manually.
s32 extended_carry2 = zext s1 carry2
icmp s32 extended_carry1, extended_carry2

So before instruction selection we would need to check if there are any overlaps and if so try to fix them if possible or abort compilation if that is not possible?

Then when you do instruction selection, just map the register to the physical register. (You may want to do the fix-ups I mentioned here.)

What exactly do you mean by "map the register to the physical register"? Do you mean that I should just emit a COPY from/to this physical register and constrain the virtual register to the non-allocatable register class? Because this actually gives me an assertion that the register class must be allocatable. Or do you mean to use the physical register in the target instruction instead of the virtual one (i.e. drop the usage of the virtual register and replace it with the physical register)? The target instruction uses the physical register implicitly by the way.

I also tried to take a look at how AArch64 implements G_UADDO/G_UADDE but it doesn't seem to fully implement selection for those? Maybe I just didn't read the code properly...

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.

I looked at the CC regbank of AArch64, however I didn't see any mappings using it in RegBankInfo. Is that something that AArch64 simply hasn't implemented or is there something going on that I'm not seeing. I mean, what is the point to have a register bank when it is not used during regbankselect? I would at least expect that for example G_UADDO would use this register bank for the s1 carry out.

Generally speaking, I think you would need a copy (or zext/sext) to move the value in general purpose land (like in my example with zext). If you don’t have such use, the copy wouldn’t be emitted and there is nothing to do.

To be honest, I think the situation is complicated today because GISel lack the fix-up that I mentioned. I am actually surprised that didn’t come up as an issue yet (x86 is full of these flags). Going forward, I think it should just come with the framework: you assign a physreg selection time, GISel copies it around/reschedule appropriately to preserve the value. There’s a lot of wishful thinking here x).

Cheers,
-Quentin

Best regards,

Dominik

Hi Dominik,

Hi Quentin,

thank you for the reply! I have a couple more questions that came up when I tried to implement this today. I hope you can help me out with this again!

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.
The register class could be "not allocatable” (see CCR in AArch64 for instance).
Note, that if that class is unallocable, you’ll have to manage manually any overlapping that could arise.
E.g.,
s32 …, s1 carry1 = G_UADDO
s32 …, s1 carry2 = G_UADDO
s32 extended_carry1 = zext s1 carry1 <— carry1 will be destroyed by the second G_UADDO, you’ll have to fix the schedule or copy the value manually.
s32 extended_carry2 = zext s1 carry2
icmp s32 extended_carry1, extended_carry2

So before instruction selection we would need to check if there are any overlaps and if so try to fix them if possible or abort compilation if that is not possible?

You could do it after (or during selection maybe), but yeah, you’ll need to fix that eventually. For now, I guess you could just abort if you run in that situation.

Then when you do instruction selection, just map the register to the physical register. (You may want to do the fix-ups I mentioned here.)

What exactly do you mean by “map the register to the physical register”? Do you mean that I should just emit a COPY from/to this physical register and constrain the virtual register to the non-allocatable register class?

Because this actually gives me an assertion that the register class must be allocatable. Or do you mean to use the physical register in the target instruction instead of the virtual one (i.e. drop the usage of the virtual register and replace it with the physical register)?

The latter: replace the virtual register with the physical register directly. That way the only copies left would be to/from general purpose registers, if any. The problem with doing that is that you have to check for potential overlaps manually.

Another option would be to go with the special register class we already talked about, make it allocatable, with just one register. That way if there are overlapping ones, regalloc will take care of them (probably with "ran out of register” if you don’t do anything.)

The target instruction uses the physical register implicitly by the way.

I also tried to take a look at how AArch64 implements G_UADDO/G_UADDE but it doesn’t seem to fully implement selection for those? Maybe I just didn’t read the code properly…

That could be right, it’s been a while since I looked at the code there.

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.

I looked at the CC regbank of AArch64, however I didn’t see any mappings using it in RegBankInfo.

I see that in AArch64RegisterBanks.td:

/// Conditional register: NZCV.
def CCRegBank : RegisterBank<“CC”, [CCR]>;

But that’s about it.

Is that something that AArch64 simply hasn’t implemented or is there something going on that I’m not seeing. I mean, what is the point to have a register bank when it is not used during regbankselect?

Good point. Again, it’s been a while I haven’t looked at AArch64, so I don’t really know the status here.
@Tim, @Amara?

I would at least expect that for example G_UADDO would use this register bank for the s1 carry out.

That would have been my expectation as well.

Hi Quentin,

thanks again for your time. We were now able to implement this and it seems to be working as intended.

In the end with went with replacing the virtual registers with the physical one and checking for overlaps. We wanted to let the register allocator handle it initially but ended up having a couple more issues than we expected with that, so we now do this handling ourselves.

It would still be very interesting to know why AArch64 isn’t doing something similar and how exactly they are solving this issue, but for now I’m happy that we now have a working implementation for our target.

Thank you and best regards,

Dominik

Hi Dominik,

Sorry I’ve been very busy with some urgent things so haven’t had time to read this thread. I’ll try to get back to you this week or early next.

Cheers,
Amara

Hi Dominik,

Hi Quentin,

thank you for the reply! I have a couple more questions that came up when I tried to implement this today. I hope you can help me out with this again!

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.
The register class could be "not allocatable” (see CCR in AArch64 for instance).
Note, that if that class is unallocable, you’ll have to manage manually any overlapping that could arise.
E.g.,
s32 ..., s1 carry1 = G_UADDO
s32 ..., s1 carry2 = G_UADDO
s32 extended_carry1 = zext s1 carry1 <— carry1 will be destroyed by the second G_UADDO, you’ll have to fix the schedule or copy the value manually.
s32 extended_carry2 = zext s1 carry2
icmp s32 extended_carry1, extended_carry2

So before instruction selection we would need to check if there are any overlaps and if so try to fix them if possible or abort compilation if that is not possible?

Then when you do instruction selection, just map the register to the physical register. (You may want to do the fix-ups I mentioned here.)

What exactly do you mean by "map the register to the physical register"? Do you mean that I should just emit a COPY from/to this physical register and constrain the virtual register to the non-allocatable register class? Because this actually gives me an assertion that the register class must be allocatable. Or do you mean to use the physical register in the target instruction instead of the virtual one (i.e. drop the usage of the virtual register and replace it with the physical register)? The target instruction uses the physical register implicitly by the way.

I also tried to take a look at how AArch64 implements G_UADDO/G_UADDE but it doesn't seem to fully implement selection for those? Maybe I just didn't read the code properly…

AArch64 doesn’t use the condition register bank, it was added before I joined the project so I’m not entirely sure what the intent was. However, the condition register side effect is something that we explicitly capture in SSA variables by copying every definition into a GPR, for which we do have regbanks for. We do this by using CSET/CSINC instructions on AArch64 which allow us to set a GPR register using the value of the condition register. We do this in every selection of an instruction like G_ICMP or G_UADDO which may write write a flag output.

%c1(s1) = G_ICMP eq, %foo, %bar

…can be selected into
cmp x0, x1
csinc x8, xzr, xzr, eq

Which results in the %c1 value being written to the register x8. Because we select in this way, we don’t have any extra side effects to model before selection time. It’s all in SSA. We then have selection patterns that try to optimize sequences like this if we can depending on the context.

For G_UADDO, you can see in the selector code that we emit an ADDS flag setting instruction, and then extract the carry flag using a CSET into the carry destination virtual register.

Cheers,
Amara

Hi Amara,

thank you for taking the time to reply to this thread, even though our problem has been solved by now. I appreciate it!

Hi Dominik,

Hi Quentin,

thank you for the reply! I have a couple more questions that came up when I tried to implement this today. I hope you can help me out with this again!

I would expect that you could create a register class and register bank for the special register. That way you have something to map to when you do register bank select.
The register class could be "not allocatable” (see CCR in AArch64 for instance).
Note, that if that class is unallocable, you’ll have to manage manually any overlapping that could arise.
E.g.,
s32 ..., s1 carry1 = G_UADDO
s32 ..., s1 carry2 = G_UADDO
s32 extended_carry1 = zext s1 carry1 <— carry1 will be destroyed by the second G_UADDO, you’ll have to fix the schedule or copy the value manually.
s32 extended_carry2 = zext s1 carry2
icmp s32 extended_carry1, extended_carry2

So before instruction selection we would need to check if there are any overlaps and if so try to fix them if possible or abort compilation if that is not possible?

Then when you do instruction selection, just map the register to the physical register. (You may want to do the fix-ups I mentioned here.)

What exactly do you mean by "map the register to the physical register"? Do you mean that I should just emit a COPY from/to this physical register and constrain the virtual register to the non-allocatable register class? Because this actually gives me an assertion that the register class must be allocatable. Or do you mean to use the physical register in the target instruction instead of the virtual one (i.e. drop the usage of the virtual register and replace it with the physical register)? The target instruction uses the physical register implicitly by the way.

I also tried to take a look at how AArch64 implements G_UADDO/G_UADDE but it doesn't seem to fully implement selection for those? Maybe I just didn't read the code properly…

AArch64 doesn’t use the condition register bank, it was added before I joined the project so I’m not entirely sure what the intent was. However, the condition register side effect is something that we explicitly capture in SSA variables by copying every definition into a GPR, for which we do have regbanks for. We do this by using CSET/CSINC instructions on AArch64 which allow us to set a GPR register using the value of the condition register. We do this in every selection of an instruction like G_ICMP or G_UADDO which may write write a flag output.

%c1(s1) = G_ICMP eq, %foo, %bar

…can be selected into
cmp x0, x1
csinc x8, xzr, xzr, eq

Which results in the %c1 value being written to the register x8. Because we select in this way, we don’t have any extra side effects to model before selection time. It’s all in SSA. We then have selection patterns that try to optimize sequences like this if we can depending on the context.

I see! And even if you can't optimize such a sequence it's not a problem because of the CSET/CSINC instructions, correct? This is basically what is missing on our architecture to make this work (nicely). I guess if we would have tried to combine such sequences after selection, we would most likely have been fine for the vast majority of programs. But since we don't have an equivalent to these two instructions on our architecture, we would still have to cover the case where we cannot optimize it and either repair or reject it.

In the end I think we were able to come up with a good solution that is working for us (thanks again Quentin!) but it is always interesting to see how other backends solve similar problems.

Thanks to everyone involved!

Cheers,
Dominik