The following is a pared down version of real production code to illustrate the problem.
int foo_optimized(int num)
{
const auto arr = new int[num];
int ret = 0;
for (int i = 0; i < num; ++i) {
arr[i] = i;
ret += arr[i];
}
delete[] arr;
return ret;
}
int foo_missed(int num)
{
const auto arr = new int[num];
int ret = 0;
for (int i = 0; i < num; ++i) arr[i] = i;
for (int i = 0; i < num; ++i) ret += arr[i];
delete[] arr;
return ret;
}
foo_optimized()
and foo_missed()
perform the same work, yet for foo_missed()
clang
fails to deduce that the allocation is pointless and remove it.
In fact, it looks like the optimizer completely gives up applying basic optimizations due to the presence of the double loops. If we change the functions to
int foo_optimized(int num)
{
const auto arr = new int[num];
int ret = 0;
for (int i = 0; i < num; ++i) arr[i] = 0;
for (int i = 0; i < num; ++i) {
arr[i] = i;
ret += arr[i];
}
delete[] arr;
return ret;
}
int foo_missed(int num)
{
const auto arr = new int[num];
int ret = 0;
for (int i = 0; i < num; ++i) arr[i] = 0;
for (int i = 0; i < num; ++i) arr[i] = i;
for (int i = 0; i < num; ++i) ret += arr[i];
delete[] arr;
return ret;
}
clang
fails to delete the for (int i = 0; i < num; ++i) arr[i] = 0
loop in foo_missed()
, even though it ostensibly understands the pattern well enough to replace it with a call to memset()
! Note it correctly optimizes out the loop in foo_optimized()
.
I was surprised that such a seemingly simple transformation was enough to defeat the optimizer.
Is there a compiler flag that users can enable to allow clang
to optimize both forms equally? Even at the cost of additional compile time?