[icFuzz] Help needed with analyzing randomly generated tests that fail on clang 3.4 trunk

Hi,

I just submitted a bug report with a package containing 107 small test cases that fail on the latest LLVM/clang 3.4 main trunk (184563). Included are test sources, compilation commands, test input files, and results at –O0 and –O2 when applicable.

http://llvm.org/bugs/show_bug.cgi?id=16431

These tests have been automatically generated by an internal tool at Intel, the Intel Compiler fuzzer, icFuzz. The tests are typically very small. For example, for the following simple loop (test t5702) on MacOS X, clang at –O2 generates a binary that crashes:

// Test Loop Interchange

for (j = 2; j < 76; j++) {

for (jm = 1; jm < 30; jm++) {

h[j-1][jm-1] = j + 83;

}

}

The tests are put in to two categories

  • tests that have different runtime outputs when compiled at -O0 and -O2 (this category also includes runtime crashes)

  • tests that cause infinite loops in the Clang optimizer

Many of these failing tests could be due to the same bug, thus a much smaller number of root problems are expected.

Any help with triaging these bugs would be highly appreciated.

Thanks,

-moh

Hi,****

** **

I just submitted a bug report with a package containing 107 small test
cases that fail on the latest LLVM/clang 3.4 main trunk (184563). Included
are test sources, compilation commands, test input files, and results at
–O0 and –O2 when applicable.****

** **

http://llvm.org/bugs/show_bug.cgi?id=16431****

** **

These tests have been automatically generated by an internal tool at
Intel, the Intel Compiler fuzzer, icFuzz. The tests are typically very
small. For example, for the following simple loop (test t5702) on MacOS
X, clang at –O2 generates a binary that crashes:****

** **

// Test Loop Interchange****

for (j = 2; j < 76; j++) {****

    for (jm = 1; jm < 30; jm++) {****

        h[j-1][jm-1] = j + 83;****

    }****

}****

** **

The tests are put in to two categories ****

- tests that have different runtime outputs when compiled at -O0 and -O2
(this category also includes runtime crashes)

Are these tests generated in a manner such that they have a very low
probability of using undefined behaviour?

Nick

The tests by design are syntactically correct, semantically correct, and have deterministic output.

-moh

Hi,
I wanted to believe you had a randomized code generator that could cover the entire
valid input space. (smiles :wink: But the test cited was not generated by a randomized
code generator. It is way too structured. Even so, it could have been generated by a
partially randomized code generator that uses valid code idioms with some aspects
randomized. Interesting to think about it.

I wrote a fully randomized Java bytecode sequence generator once. It covered the entire
valid input space. I used it to test a Java hardware accelerator chip that had just been
taped out. It found quite a few bugs, where 4 of them were fatal to the chip.

Writing such a tool for a general purpose compiler would be quite a bit more
difficult, but it is doable. Imagine releasing compilers that are bug free,
at least for a given language and a particular platform. It's doable, and I wonder
why corporations don't invest in it.

enjoy,
Karen

Hi Karen,

Thanks much for your comment and for sharing of your experience. icFuzz has a core that is "really" random, but does not cover the entire C space. The tool was designed from scratch to be extensible, and comes with a couple of extensions that target some of compiler optimizations optimizing compilers typically do: CSE, loop interchange, vectorization, etc. But even in the case of extensions, other than the structure of the extension, most of the details are highly parametric and configurable. The chances of a totally random test meeting all the criteria for certain optimizations to kick in is lower than the case in guided test-generation. I actually have the x-y charts where x is the number of generated tests and y is the number of fails for both completely random tests as well as random+guided tests, and the curve of guided tests is significantly above that of random tests. Also, the test generator is restricted to the "unsigned int" type where the C++ semantics are precisely defined even in the cases of overflow. Also, features that have implementation-dependent behavior are excluded by design (e.g., expressions including side effects during their evaluation, array out of bounds access, etc.).

But you are right in that the generated code has some structure :wink:

Cheers,
-moh

Hi Mohammad,
Thanks for clarifying. Enjoyed reading the details.

I understand your point of view. And objectives. They are quite reasonable. And it
is the natural progression to take a fully randomized engine and add in configurable
constraints. I coded in a number of such constraints for my random generator. One
could configure any subset of bytecodes for use in sequence generation. And one could
also define the weights for each valid bytecode, which all carried the same probability
by default. And I also enabled one to inject a starting sequence of bytecodes, to enable
the randomized bytecodes to begin with the DUT in a specified state. With those
capabilities, one could do tightly focused unit testing with the generator.

Using such a tool in unconstrained full random mode is a pure mathematical brute force
method. It is completely unintuituve to conventional QA and debugging methodologies.
The generated code sequences represent pure chaos and don't direct the DUT to do
anything 'useful'. In my work, I used very long randomized sequences of 50,000 fully
random bytecodes per test run. The intent was to perturb system state with each new
random input token, because every bug requires a specific state transition to occur, or
maybe a specific sequence of consecutive state transitions to occur. You need fully
automated failure detection and sufficient computing resources to realize full
coverage in a given time frame. And if everything is done properly, you can find all
the bugs in an analytically rigorous process. In my experience, the test setup can be
the most challenging part of such a system.

Good luck with your work. I believe in the efficacy of randomized testing.

enjoy,
Karen

Hi Moh,

Thanks for this. I’m really glad to see the work you’re doing in this area and believe it will be extremely helpful in improving the quality of the compiler.

-Jim

Hi Moh,

Thanks for this. I’m really glad to see the work you’re doing in this
area and believe it will be extremely helpful in improving the
quality of the compiler.

-Jim

Hi,

I just submitted a bug report with a package containing 107 small
test cases that fail on the latest LLVM/clang 3.4 main trunk
(184563). Included are test sources, compilation commands, test
input files, and results at –O0 and –O2 when applicable.

http://llvm.org/bugs/show_bug.cgi?id=16431

These tests have been automatically generated by an internal tool at
Intel, the Intel Compiler fuzzer, icFuzz. The tests are typically
very small. For example, for the following simple loop (test t5702)
on MacOS X, clang at –O2 generates a binary that crashes:

// Test Loop Interchange
for (j = 2; j < 76; j++) {
for (jm = 1; jm < 30; jm++) {
h[j-1][jm-1] = j + 83;
}
}

The tests are put in to two categories
- tests that have different runtime outputs when compiled at -O0 and
-O2 (this category also includes runtime crashes)
- tests that cause infinite loops in the Clang optimizer

Many of these failing tests could be due to the same bug, thus a much
smaller number of root problems are expected.

Any help with triaging these bugs would be highly appreciated.

I've gone through all of the miscompile cases, used bugpoint to reduce them, and opened individual PRs for several distinct bugs. So far we have: PR16455 (loop vectorizer), PR16457 (sccp), PR16460 (instcombine). Thanks again for doing this! Do you plan on repeating this testing on a regular basis? Can it be automated?

-Hal

Great job, Hal!

Sure. I'd be happy to run icFuzz and report the fails once these bugs are fixed and thereafter whenever people want new runs. Obviously, this can be automated, but the problem is that icFuzz is not currently open sourced. Once there's a bug in the compiler, there's really no limit in the number of failing tests that can be generated, so it's more productive to run the generator after the previously reported bugs are fixed.

We've also seen cases where the results of "clang -O2" are different on Mac vs. Linux/Windows.

Just let me know when you want a new run.

Cheers,
-moh

Great job, Hal!

Sure. I'd be happy to run icFuzz and report the fails once these bugs
are fixed and thereafter whenever people want new runs. Obviously,
this can be automated, but the problem is that icFuzz is not
currently open sourced.

I would be happy to see this open sourced, but I think that we can work something out regardless.

Also, once we get the current set of things resolved, I think it would be useful to test running with:

-- -O3, LTO (-O4 or -flto),
-- -fslp-vectorize, -fslp-vectorize-aggressive (which are actually separate optimizations)
-- -ffast-math (if you can do floating point with tolerances, or at least -ffinite-math-only), -fno-math-errno
(and there are obviously a whole bunch of non-default code-generation and target options).

Is it feasible to set up runs with different flags?

Once there's a bug in the compiler, there's
really no limit in the number of failing tests that can be
generated, so it's more productive to run the generator after the
previously reported bugs are fixed.

Agreed.

We've also seen cases where the results of "clang -O2" are different
on Mac vs. Linux/Windows.

I recall an issue related to default settings for FP, and differences with libm implementation. Are there non-floating-point cases?

Just let me know when you want a new run.

Will do!

-Hal

Is it feasible to set up runs with different flags?

Absolutely. In fact, a common way of using tool is to compare two different settings of two different compilers (e.g., MSVC -O0 vs. ICC -O3, etc.). I am currently working with the Mozilla folks on Emscripten/asm.js. So the way I now use icFuzz is comparing "clang -O2" vs. "emcc -O2 (which compiles C++ to JS) and running the generated JavaScript code which generates the final output" :slight_smile:

Are there non-floating-point cases?

No. Currently, icFuzz tests are restricted to the "unsigned int" type to avoid FP rounding errors and be absolutely false positive.

I meant "yes", the tests are non-floating-point.

I meant "yes", the tests are non-floating-point.

If we stick to strict IEEE mode, and stay away from sin, cos, etc. these should be just as deterministic as unsigned integers. It would be quite useful to have floating-point tests as well.

-Hal

> Great job, Hal!
>
> Sure. I'd be happy to run icFuzz and report the fails once these
> bugs
> are fixed and thereafter whenever people want new runs. Obviously,
> this can be automated, but the problem is that icFuzz is not
> currently open sourced.

I would be happy to see this open sourced, but I think that we can
work something out regardless.

Also, once we get the current set of things resolved, I think it
would be useful to test running with:

-- -O3, LTO (-O4 or -flto),
-- -fslp-vectorize, -fslp-vectorize-aggressive (which are actually
separate optimizations)
-- -ffast-math (if you can do floating point with tolerances, or at
least -ffinite-math-only), -fno-math-errno
(and there are obviously a whole bunch of non-default
code-generation and target options).

Is it feasible to set up runs with different flags?

> Once there's a bug in the compiler, there's
> really no limit in the number of failing tests that can be
> generated, so it's more productive to run the generator after the
> previously reported bugs are fixed.

Agreed.

>
> We've also seen cases where the results of "clang -O2" are
> different
> on Mac vs. Linux/Windows.

I recall an issue related to default settings for FP, and differences
with libm implementation. Are there non-floating-point cases?

>
> Just let me know when you want a new run.

Will do!

Mohammad,

Can you please re-run these now? I know that the original loop-vectorizer bugs causing the miscompiles have been fixed, and the others also seem to have been resolved as well.

Thanks again,
Hal

Hal,

I ran the failing tests in the attachment to the bug 16431 on the latest clang trunk (version 3.4 trunk 187225).
http://llvm.org/bugs/show_bug.cgi?id=16431

The following tests still fail:
              Tests in diff: t10236 t12206 t2581 t6734 t7788 t7820 t8069 t9982
All tests in InfLoopInClang: t19193 t22300 t25903 t27872 t33143 t8543

Meanwhile, I'll launch a new run of icFuzz and will post the results later.

-moh

Hal,

Just posted a package containing 214 small tests showing bugs in the latest Clang (3.4 trunk 187225) on MacOS X when compiled at -O2.
http://llvm.org/bugs/show_bug.cgi?id=16431

These are new tests different from the previously posted ones, but their root causes could be the same as before or could actually be new bugs.

Cheers,
-moh

Hal,

Just posted a package containing 214 small tests showing bugs in the
latest Clang (3.4 trunk 187225) on MacOS X when compiled at -O2.
http://llvm.org/bugs/show_bug.cgi?id=16431

These are new tests different from the previously posted ones, but
their root causes could be the same as before or could actually be
new bugs.

Great, thanks! I'll go through them.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Mohammad R Haghighat" <mohammad.r.haghighat@intel.com>
Cc: llvmdev@cs.uiuc.edu
Sent: Sunday, July 28, 2013 11:01:14 PM
Subject: Re: [LLVMdev] [icFuzz] Help needed with analyzing randomly generated tests that fail on clang 3.4 trunk

> Hal,
>
> Just posted a package containing 214 small tests showing bugs in
> the
> latest Clang (3.4 trunk 187225) on MacOS X when compiled at -O2.
> http://llvm.org/bugs/show_bug.cgi?id=16431
>
> These are new tests different from the previously posted ones, but
> their root causes could be the same as before or could actually be
> new bugs.

Can you rerun icFuzz on the current trunk? At least most of the failures in the last batch were caused by an alias analysis bug that has now been fixed (thanks Arnold!).

Thanks again,
Hal

Hi Hal,

Just submitted 27 failing tests on clang version 3.5, trunk 199158.
http://llvm.org/bugs/show_bug.cgi?id=16431

I expect that these failures correspond to 2+ unique bugs.

Cheers,
-moh

From: "Mohammad R Haghighat" <mohammad.r.haghighat@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Friday, January 17, 2014 11:47:01 AM
Subject: RE: [LLVMdev] [icFuzz] Help needed with analyzing randomly generated tests that fail on clang 3.4 trunk

Hi Hal,

Just submitted 27 failing tests on clang version 3.5, trunk 199158.
http://llvm.org/bugs/show_bug.cgi?id=16431

I expect that these failures correspond to 2+ unique bugs.

Great, thanks! I'll work on reducing them in the next few days.

-Hal