for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1
if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;
bar = ballot(true); // ballot 2
}
versus:
int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);
if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}
I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,
What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.
Yes, if there weren’t any calls to a ballot, then the transform would
be legal. What I was saying in the beginning was that ballot() would
have some special rules attached to it. It is of course undesirable to
have flags to enforce correctness, but that is the point we are at
right now.
Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need
to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice.
I think you misunderstand me if you think I was listing disallowed
transforms. My question was if it is possible to have multi-threaded
semantics in the source language, but in the compiler have a
single-threaded view, where some properties of the CFG would determine
what is legal and not for some functions with a special flag. I agree
it is more desirable to have the the semantics specified in the
IR. However, I am exploring this from a practical point of view, since
ballot() is very rare compared to all code that is being compiled for
all targets. These intrinsics would always have to be considered when
writing a pass. They seem to me harder to think about, and
test, for someone who is working on a single-threaded target, compared
to looking at a flag. If we allow ourselves to look at which
transforms that might violate these properties, we could would free up
the rest of the compiler (and developers) to not have to worry about
these things. Intrinsics would have to be maintained throughout the
entire compilation process in every pass.
The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.
I’m trying to figure out if we are in the “almost never” territory
here or not.
Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.
I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.
What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.
Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.
Any transform that re-arranges control flow would potentially have to
know about the properties of ballot(), and the rules with respect to
the CFG (and maybe consider the target) to know where to insert the
intrinsics. I had the impression that the control flow convergence was
in part specified by what the target architecture can handle. One of
the more complicated cases would be linearization where the control
flow is completely rewritten, and is encoded in a variable that says
which basic block is the next one to execute. Another case is DCE,
where a ballot() could be eliminated, and it would potentially have to
remove a number of intrinsics to enable later optimizations (unless it
would affect performance?), so the intrinsics will require some
non-local updates.
Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?
They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).
So we might have them without a ballot(), which would seem to make it
more difficult for structurizers or other transforms to maintain the
semantics and insert intrinsics.
How are uniform branches handled? Do they affect the convergence model?
We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.
I couldn’t quite understand the last sentence, but I assume the
conditions would prevent removing convergence points that help
performance. Post-domination might not be adequate if there are loops
btw.
for (int i = 0; i < 2; i++) {
foo = ballot(true); // ballot 1
if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;
bar = ballot(true); // ballot 2
}
versus:
int i = 0;
while (true) {
do {
if (i == 2) break;
foo = ballot(true); // ballot 1
i++;
} while (threadID % 2 == 0);
if (i == 2) break;
bar = ballot(true); // ballot 2
i++;
}
I think you can remove the second “i++”, otherwise we can increment “i” twice
if threadID % 2 != 0, but I see the issue. Using the equivalence classes would
prevent this transform, since we have re-arranged the control flow in that way,
What do you mean? Note that the transforms that lead from the first
example to the second are actually desirable if there aren’t any
ballots or other convergent operations, so banning them entirely is a
no-go. Otherwise, note that the ballot here can be nested arbitrarily
deep, which means that jump forwarding/tail duplication has to be
aware of the entire loop unless we have some kind of merge intrinsic.
Also, I should say that while interesting, control equivalence classes
can be part of the implementation, but not the specification. We
need to define the semantics of the IR – that is, how a program
compiled from any given IR is allowed to do when executed (and not
what it looks like when compiled) – first, and then which
transforms are allowed/not allowed will fall out of that. We can’t
start by listing disallowed transforms, because then when someone
comes along and writes a new optimization, it might be technically
“allowed” even though it breaks something in practice. The only way to
conclusively prove that transforms will never break code that’s
supposed to work (except for bugs in the transforms) is to define the
semantics of the IR, and then to make sure that it refines the
semantics of the source language. This is why the LangRef almost never
talks about allowed/disallowed transforms except as examples to
explain some given semantics, and if you don’t follow that rule, your
patch will probably be rejected.
Now, to define a semantics for ballot(), we need to define what it’s
allowed to return for any given execution of any given program, and to
do that, we need to define which threads must be active together when
it’s reached, which in turns means we need to define when control flow
can re-converge somehow. Any proposal for how to handle ballot() must
start here.
I’m not sure if using these rules will be easier or harder than dealing with
intrinsics. One problem is that the rules for single-threaded code might seem
arbitrary, and it would be hard to reason about them in a larger context.
Nicolai’s proposal solves this by having the frontend emit a merge intrinsic
before the i++ is emitted. This prevents the jump forwarding from occurring.
I was thinking about getting through the single-thread view and the issues with
that first, then I will think more about the multi-thread and explicit convergence.
If we are done with the single-thread stuff for now these are the question that I
have been thinking about with the multi-threaded view:
What happens to new control flow created by transforms, and what will guide
the insertion of intrinsics in the new code? Code structurization is one example
were this could happen.
Hopefully, it’s clear from the above how this all falls out. The
frontend for e.g. GLSL or SPIR-V would have to insert the merge
intrinsics to preserve the semantics of the source language. Any
transforms must refine the semantics of the IR, although I can’t think
of a scenario where that would involve emitting any new merge
intrinsics. Usually, structurized IR’s have their own semantics about
when control flow converges, so a code structurizer should respect the
original semantics. AMDGPU has its own code structurizer that runs
very late in the pipeline (although there are plans to remove it), so
we might have to change that to make it respect the merge intrinsic
intrinsics, and then we’ll correctly implement them “for free”.
Would they only be present if ballot and similar functions are used, or do they
have to be present everywhere?
They’d only have to be present when ballot or other convergent
functions are called, since otherwise it doesn’t matter when control
flow re-converges. However, we may want to keep them around for
performance reasons (usually destroying convergence points is bad for
performance).
How are uniform branches handled? Do they affect the convergence model?
We may be able to remove convergence points if branches are uniform.
In Nicolai’s proposal, I believe we’d want to remove a merge intrinsic
when all the branches that it post-dominates that aren’t also
post-dominated by some other merge intrinsic are uniform.