Intrinsics and dead instruction/code elimination

Hi all,

I'm interested in the impact of representing code via intrinsic functions, in contrast to via an instruction, when it comes to performing dead instruction/code elimination. As a concrete example, lets consider the simple case of the llvm.*.with.overflow.* intrinsics.

If I have some sequence (> 1) of llvm.*.with.overflow.* intrinsics, as in the form of:

@global = global i32 0

define void @fun() {
  entry:
  %res1 = call {i32, i1} @llvm.*.with.overflow.i32(i32 %a, i32 %b)
  %sum1 = extractvalue {i32, i1} %res, 0
  %obit1 = extractvalue {i32, i1} %res, 1
  store i32 %obit1, i32* @global
  ...
  %res2 = call {i32, i1} @llvm.*.with.overflow.i32(i32 %a, i32 %b)
  %sum2 = extractvalue {i32, i1} %res, 0
  %obit2 = extractvalue {i32, i1} %res, 1
  store i32 %obit2, i32* @global
}

then I assume an optimisation pass is able to eliminate the store i32 %obit1, i32* @global, since store i32 %obit2, i32* @global clearly clobbers the global without any interleaving load/access. However, my question is whether representing code as an intrinsic limits further dead instruction/code elimination. In this case, the intrinsic will produce an arithmetic operation followed by a setcc on overflow on x86. Given the first store is dead, the first setcc on overflow instruction is also dead and can be eliminated. Is LLVM capable of such elimination with intrinsics, or is the expectation that an optimisation pass replace llvm.*.with.overflow.* intrinsics with their corresponding non-intrinsic arithmetics operations in order to achieve the same result, or is there another solution?

Thanks in advance

Intrinsics should be optimized as well as instructions. In this specific case, these intrinsics should be marked readnone, which means that load/store optimization will ignore them. Dead code elimination will delete the intrinsic if it is dead etc. Are you seeing this fail on a specific testcase?

-Chris

Hi all,

I'm interested in the impact of representing code via intrinsic functions, in contrast to via an instruction, when it comes to performing dead instruction/code elimination. As a concrete example, lets consider the simple case of the llvm.*.with.overflow.* intrinsics.

If I have some sequence (> 1) of llvm.*.with.overflow.* intrinsics, as in the form of:

@global = global i32 0

define void @fun() {
entry:
%res1 = call {i32, i1} @llvm.*.with.overflow.i32(i32 %a, i32 %b)
%sum1 = extractvalue {i32, i1} %res, 0
%obit1 = extractvalue {i32, i1} %res, 1
store i32 %obit1, i32* @global
...
%res2 = call {i32, i1} @llvm.*.with.overflow.i32(i32 %a, i32 %b)
%sum2 = extractvalue {i32, i1} %res, 0
%obit2 = extractvalue {i32, i1} %res, 1
store i32 %obit2, i32* @global
}

then I assume an optimisation pass is able to eliminate the store i32 %obit1, i32* @global, since store i32 %obit2, i32* @global clearly clobbers the global without any interleaving load/access. However, my question is whether representing code as an intrinsic limits further dead instruction/code elimination. In this case, the intrinsic will produce an arithmetic operation followed by a setcc on overflow on x86. Given the first store is dead, the first setcc on overflow instruction is also dead and can be eliminated. Is LLVM capable of such elimination with intrinsics, or is the expectation that an optimisation pass replace llvm.*.with.overflow.* intrinsics with their corresponding non-intrinsic arithmetics operations in order to achieve the same result, or is there another solution?

Intrinsics should be optimized as well as instructions. In this specific case, these intrinsics should be marked readnone, which means that load/store optimization will ignore them. Dead code elimination will delete the intrinsic if it is dead etc.

I understand that dead code elimination is able to delete the intrinsic if it is dead. What I'm interested in is whether or not, despite the entire intrinsic not being dead, anything is able to eliminate the setcc on overflow instruction part of the first intrinsic given that the store of obit1 is dead, thus obit1 is not needed, thus extracting the overflow bit from the CFLAGS register via a setcc instruction is no longer needed. I assume nothing is able to perform such optimisation on intrinsics and my guess is that the only option is, as I said, a pass which rewrites the first intrinsic to just its corresponding arithmetic instruction once a pass has dead store eliminated the first store. Is this the case?

Are you seeing this fail on a specific testcase?

No I don't yet have code for the particular testcases I have in mind. I'm just exploring the limitations of intrinsics over instructions.

I believe so, but you should write a testcase and see for yourself.

-Chris

I'll certainly do that before I start implementing additional intrinsics, since the llvm.*.with.overflow.* intrinsics are a simple subset of the overall optimisations I'm interested in exploring. The reason I'm interested in whether or not the setcc on overflow instruction part of the first intrinsic is able to be eliminated directly, instead of being eliminated indirectly via a pass which replaces llvm.*.with.overflow.* intrinsics with their corresponding non-intrinsic arithmetic operations, is because I'm interested in an intrinsic of the form lvm.*.with.carry.overflow.zero.sign.* which enables the extraction of the CF, OF, ZF, and SF values from the FLAGS register, where any combination of the extraction may not be needed, since at in the general case most will be clobbered. If it is the case that the indirect approach to elimination is the only approach, this means I need to write all 14 subsets of the intrinsic. At this point it's probably worth instead considering just writing some instructions and supporting a limited set of passes.

Chris, all,

I finally got around to implementing this scenario a while back in what I was doing at the time. It is, Chris, as you believed to be the case, that the machine-code-generation layers include optimisations to remove the individually dead setcc operations on x86.

It interests me, however, at this point, to think in broader terms here about the approach which has been adopted to deal with what boils down to implicit data dependancies. LLVM has adopted the approach that is almost universally seen traditionally, wherein, at the high-level, these dependancies are tied together in a unified indistinguishable unit (in this case an intrinsic) and then at the lower-level back-end, special casing is applied to optimise and reduce. I have seen very few cases of, but still wonder whether, creating virtual data dependancies at the high-level, and thus allowing more general and possibly simpler/earlier optimisations in the front-end, rather than special cases in each back-end, has value.

I leave this as an open question to ponder, as I do.