Disable Short-Circuit Evaluation?

Is there any way to disable short-circuit evaluation of expressions in Clang/LLVM?

Let’s say I have C code like the following:

bool validX = get_group_id(0) > 32;

int globalIndexY0 = get_group_id(1)186 + 6get_local_id(1) + 0 + 1;
bool valid0 = validX && globalIndexY0 >= 4 && globalIndexY0 < 3910;

int globalIndexY1 = get_group_id(1)186 + 6get_local_id(1) + 1 + 1;
bool valid1 = validX && globalIndexY1 >= 4 && globalIndexY1 < 3910;

int globalIndexY2 = get_group_id(1)186 + 6get_local_id(1) + 2 + 1;
bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

Clang, even at -O0, is performing short-circuit evaluation of these expressions, resulting in a fair number of branch instructions being generated. For most targets, this is a beneficial optimizations. However, for my target (PTX), it would be most beneficial to actually evaluate the entire expression and remove the unneeded branches. Is this possible with current Clang/LLVM?

The short-circuit nature of && and || are required by the C specification. If you don't want them to be short-circuit, use & and | instead.

Justin Holewinski <justin.holewinski@gmail.com> writes:

int globalIndexY2 = get_group_id(1)*186 + 6*get_local_id(1) + 2 + 1;
bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

Clang, even at -O0, is performing short-circuit evaluation of these
expressions, resulting in a fair number of branch instructions being
generated.

It has to. This is the semantics of C. Short-circuiting is used to
defend against all sorts of undefined behavior in real code.

For most targets, this is a beneficial optimizations.

For all targets. If the code doesn't work, it's pretty useless. :slight_smile:

However, for my target (PTX), it would be most beneficial to actually
evaluate the entire expression and remove the unneeded branches. Is
this possible with current Clang/LLVM?

So for PTX what you want is if-conversion. I believe there is a pass
that does this in the ARM codegen. Of course the PTX backend will have
to support mask bits. I don't know if it does currently.

                                -Dave

A compilable testcase:

extern int get_group_id (int);
extern int get_local_id (int);

extern void check (bool, bool, bool);

void
foo (void)
{
   bool validX = get_group_id (0) > 32;

   int globalIndexY0 = get_group_id (1) * 186 + 6 * get_local_id (1) + 0 + 1;
   bool valid0 = validX && globalIndexY0 >= 4 && globalIndexY0 < 3910;

   int globalIndexY1 = get_group_id (1) * 186 + 6 * get_local_id (1) + 1 + 1;
   bool valid1 = validX && globalIndexY1 >= 4 && globalIndexY1 < 3910;

   int globalIndexY2 = get_group_id (1) * 186 + 6 * get_local_id (1) + 2 + 1;
   bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

   check (valid0, valid1, valid2);
}

More precisely, && and || are sequence points (though in C++ they may not be
sequence points if respective operator is overloaded)

[1] http://en.wikipedia.org/wiki/Sequence_point

2011/10/10 Konstantin Tokarev <annulen@yandex.ru>

10.10.2011, 18:29, “David A. Greene” <greened@obbligato.org>:

Justin Holewinski <justin.holewinski@gmail.com> writes:

int globalIndexY2 = get_group_id(1)186 + 6get_local_id(1) + 2 + 1;
bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

Clang, even at -O0, is performing short-circuit evaluation of these
expressions, resulting in a fair number of branch instructions being
generated.

It has to. This is the semantics of C. Short-circuiting is used to
defend against all sorts of undefined behavior in real code.

More precisely, && and || are sequence points (though in C++ they may not be
sequence points if respective operator is overloaded)

Sequence points don’t really come into play in a meaningful way in code like that given above. In the example given, each expression is without side effects, and none of the latter expressions are dependent on the predecessors (no “ptr != NULL && *ptr == value” sorts of things).

noting that boolean values are either exactly 1 or 0, this statement:
bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

is equivalent to this statement:
bool valid2 = validX & globalIndexY2 >= 4 & globalIndexY2 < 3910;

assuming no operator overloading or anything else crazy going on.

There may be some non-trivial work involved in converting the >= and < tests into boolean values depending on the architecture, which can seriously affect the value of such a transformation. Also worth noting is that globalIndexY2 is known to be in a register from the previous line of code and so there is no load instruction or any chance of a cache miss being avoided by a short-circuit branch, so it really boils down to the cost of the extra instructions vs the cost of the eliminated branches.

2011/10/10 Konstantin Tokarev <annulen@yandex.ru>

10.10.2011, 18:29, “David A. Greene” <greened@obbligato.org>:

Justin Holewinski <justin.holewinski@gmail.com> writes:

int globalIndexY2 = get_group_id(1)186 + 6get_local_id(1) + 2 + 1;
bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

Clang, even at -O0, is performing short-circuit evaluation of these
expressions, resulting in a fair number of branch instructions being
generated.

It has to. This is the semantics of C. Short-circuiting is used to
defend against all sorts of undefined behavior in real code.

More precisely, && and || are sequence points (though in C++ they may not be
sequence points if respective operator is overloaded)

Sequence points don’t really come into play in a meaningful way in code like that given above. In the example given, each expression is without side effects, and none of the latter expressions are dependent on the predecessors (no “ptr != NULL && *ptr == value” sorts of things).

noting that boolean values are either exactly 1 or 0, this statement:

bool valid2 = validX && globalIndexY2 >= 4 && globalIndexY2 < 3910;

is equivalent to this statement:
bool valid2 = validX & globalIndexY2 >= 4 & globalIndexY2 < 3910;

assuming no operator overloading or anything else crazy going on.

There may be some non-trivial work involved in converting the >= and < tests into boolean values depending on the architecture, which can seriously affect the value of such a transformation. Also worth noting is that globalIndexY2 is known to be in a register from the previous line of code and so there is no load instruction or any chance of a cache miss being avoided by a short-circuit branch, so it really boils down to the cost of the extra instructions vs the cost of the eliminated branches.