Adding a new External Suite to test-suite

Hi Johannes,

I'd also like to know what the intention here is. What is tested and how?

    we have a few uses for these benchmarks in the technical report:
http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf, but
since then, we came up with other applications. All these programs
produce object files without external dependencies. We have been using
them to train a predictive compiler that reduces code size (the
technical report has more about that). In addition, you can use them
to compare compilation time, for instance, as Michael had asked. We
have also used these benchmarks in two studies:

1) http://cuda.dcc.ufmg.br/angha/chordAnalysis
2) http://cuda.dcc.ufmg.br/angha/staticProperties

A few other applications that I know about (outside our research
group), include:

* Comparing the size of code produced by three HLS tools: Intel HLS,
Vivado and LegUp.
* Testing the Ultimate Buchi Automizer, to see which kind of C
constructs it handles
* Comparing compilation time of gcc vs clang

A few other studies that I would like to carry out:

* Checking the runtime of different C parsers that we have.
* Trying to infer, empirically, the complexity of compiler analyses
and optimizations.

Looking at a few of these it seems there is not much you can do as it is little code with a lot of unknown function calls and global symbols.

Most of the programs are small (avg 63 bytecodes, std 97); however,
among these 1M C functions, we have a few large ones, with more than
40K bytecodes.

Regards,

Fernando

Hi Fernando,

Hi Johannes,

I'd also like to know what the intention here is. What is tested and how?

     we have a few uses for these benchmarks in the technical report:
http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf, but
since then, we came up with other applications. All these programs
produce object files without external dependencies. We have been using
them to train a predictive compiler that reduces code size (the
technical report has more about that). In addition, you can use them
to compare compilation time, for instance, as Michael had asked. We
have also used these benchmarks in two studies:

1) http://cuda.dcc.ufmg.br/angha/chordAnalysis
2) http://cuda.dcc.ufmg.br/angha/staticProperties

A few other applications that I know about (outside our research
group), include:

* Comparing the size of code produced by three HLS tools: Intel HLS,
Vivado and LegUp.
* Testing the Ultimate Buchi Automizer, to see which kind of C
constructs it handles
* Comparing compilation time of gcc vs clang

A few other studies that I would like to carry out:

* Checking the runtime of different C parsers that we have.
* Trying to infer, empirically, the complexity of compiler analyses
and optimizations.

All the use cases sound reasonable but why do we need these kind of "weird files" to do this?

I mean, why would you train or measure something on single definition translation units and not on the original ones, potentially one function at a time?

To me this looks like a really good way to skew the input data set, e.g., you don't ever see a call that can be inlined or for which inter-procedural reasoning is performed. As a consequence each function is way smaller than it would be in a real run, with all the consequences on the results obtained from such benchmarks. Again, why can't we take the original programs instead?

Looking at a few of these it seems there is not much you can do as it is little code with a lot of unknown function calls and global symbols.

Most of the programs are small (avg 63 bytecodes, std 97); however,
among these 1M C functions, we have a few large ones, with more than
40K bytecodes.

How many duplicates are there among the small functions? I mean, close to 1M functions of such a small size (and with similar pro- and epilogue).

Cheers,

Johannes

Hi Johannes,

All the use cases sound reasonable but why do we need these kind of "weird files" to do this?

I mean, why would you train or measure something on single definition translation units and not on the original ones, potentially one function at a time?

I think that's the fundamental question :slight_smile: The short answer is that it
is hard to compile the files from open-source repositories
automatically. The weird files that you mentioned appear due to the
type inference that we run on them. Let me give you some data and tell
you the whole story.

One of the benchmark collections distributed in our website consists
of 529,498 C functions and their respective LLVM bytecodes. Out of
these files, we extracted 698,449 functions, sizes varying from one
line to 45,263 lines of code (Radare2's assembler). Thus, we produced
an initial code base of 698,449 C files, each file containing a single
function.

We run Psyche-C (http://cuda.dcc.ufmg.br/psyche-c/) with a timeout of
30 seconds on this code base. Psyche-C has been able to reconstruct
dependencies of 529,498 functions; thus, ensuring their compilation.
Compilation consists in the generation of an object file out of the
function.

Out of the 698,449 functions, 31,935 were directly compilable as-is,
that is, without type inference. To perform automatic compilation, we
invoke clang onto a whole C file. In case of success, we count as
compilable every function with a body within that file. Hence, without
type inference, we could ensure compilation of 4.6% of the programs.
With type inference, we could ensure compilation of 75,8% of all the
programs. Failures to reconstruct types were mostly due to macros that
were not syntactically valid in C without preprocessing. Only 3,666
functions could not be reconstructed within the allotted 30-second
time slot.

So, we compile automatically less about 5% of the functions that we
download, even considering all the dependencies in the C files where
these functions exist. Nevertheless, given that we can download
millions of functions, 5% is already enough to give us a
non-negligible number of benchmarks. However, these compilable
functions tend to be very small. The median number of LLVM bytecodes
is seven (in contrast with >60 once we use type inference). Said
functions are unlikely to contain features such as arrays of structs,
type casts, recursive types, double pointer dereferences, etc.

To me this looks like a really good way to skew the input data set, e.g., you don't ever see a call that can be inlined or for which inter-procedural reasoning is performed. As a consequence each function is way smaller than it would be in a real run, with all the consequences on the results obtained from such benchmarks. Again, why can't we take the original programs instead?

Well, in the end, just using the compilable functions leads to poor
predictions. For instance, using these compilable functions, YaCoS
(it's the framework that we have been using) reduces the size of
MiBench's Bitcount by 10%, whereas using AnghaBench, it achieves
16.9%. In Susan, the naturally compilable functions lead to an
increase of code size (5.4%), whereas AnghaBench reduces size by 1.7%.
Although we can find benchmarks in MiBench where the naturally
compilable functions lead to better code reduction, these gains tend
to be very close to those obtained by AnghaBench, and seldom occur.

About inlining, you are right: there will be no inlining. To get
around this problem, we also have a database of 15K whole files, which
contains files with multiple functions. The programs are available
here: http://cuda.dcc.ufmg.br/angha/files/suites/angha_wholefiles_all_15k.tar.gz

Regards,

Fernando

Hi Johannes,

>
>> All the use cases sound reasonable but why do we need these kind of "weird files" to do this?
>>
>> I mean, why would you train or measure something on single definition
>> translation units and not on the original ones, potentially one
>> function at a time?
>
> I think that's the fundamental question :slight_smile: The short answer is that it
> is hard to compile the files from open-source repositories
> automatically. The weird files that you mentioned appear due to the
> type inference that we run on them. Let me give you some data and tell
> you the whole story.
>
> One of the benchmark collections distributed in our website consists
> of 529,498 C functions and their respective LLVM bytecodes. Out of
> these files, we extracted 698,449 functions, sizes varying from one
> line to 45,263 lines of code (Radare2's assembler). Thus, we produced
> an initial code base of 698,449 C files, each file containing a single
> function.

So there are a lot of functions, most of them small. This didn't really
answer my questions and instead explained how you got here. Why do we
want functions in isolation to begin with? How many unique functions are
there after normalization? How do we measure/account for the selection
bias of this entire scheme (= 30 second timeout, systematic problems
with the interference, shortcomings of the extractor or other tools,
...)?

> We run Psyche-C (http://cuda.dcc.ufmg.br/psyche-c/) with a timeout of
> 30 seconds on this code base. Psyche-C has been able to reconstruct
> dependencies of 529,498 functions; thus, ensuring their compilation.
> Compilation consists in the generation of an object file out of the
> function.

TBH, I'm not even sure what this means. What dependences are you talking
about here?

> Out of the 698,449 functions, 31,935 were directly compilable as-is,
> that is, without type inference. To perform automatic compilation, we
> invoke clang onto a whole C file. In case of success, we count as
> compilable every function with a body within that file. Hence, without
> type inference, we could ensure compilation of 4.6% of the programs.
> With type inference, we could ensure compilation of 75,8% of all the
> programs. Failures to reconstruct types were mostly due to macros that
> were not syntactically valid in C without preprocessing. Only 3,666
> functions could not be reconstructed within the allotted 30-second
> time slot.
>
> So, we compile automatically less about 5% of the functions that we
> download, even considering all the dependencies in the C files where
> these functions exist. Nevertheless, given that we can download
> millions of functions, 5% is already enough to give us a
> non-negligible number of benchmarks. However, these compilable
> functions tend to be very small. The median number of LLVM bytecodes
> is seven (in contrast with >60 once we use type inference). Said
> functions are unlikely to contain features such as arrays of structs,
> type casts, recursive types, double pointer dereferences, etc.
>
>> To me this looks like a really good way to skew the input data set,
>> e.g., you don't ever see a call that can be inlined or for which
>> inter-procedural reasoning is performed. As a consequence each
>> function is way smaller than it would be in a real run, with all the
>> consequences on the results obtained from such benchmarks. Again, why
>> can't we take the original programs instead?
>
> Well, in the end, just using the compilable functions leads to poor
> predictions. For instance, using these compilable functions, YaCoS
> (it's the framework that we have been using) reduces the size of
> MiBench's Bitcount by 10%, whereas using AnghaBench, it achieves
> 16.9%. In Susan, the naturally compilable functions lead to an
> increase of code size (5.4%), whereas AnghaBench reduces size by 1.7%.
> Although we can find benchmarks in MiBench where the naturally
> compilable functions lead to better code reduction, these gains tend
> to be very close to those obtained by AnghaBench, and seldom occur.

You are making my point here. You say, using even "simpler" functions that
"compile" by default might not be the best way to learn from code and
that using "more complex" functions can really help. Now make the same
argument but with different cutoffs for "simple" and "complex", e.g.,
these single functions with type inference applied are simple and real
code is complex. Sure, you need more "machinery" to compile real code,
but you already use (your research) machinery (Psyche-C) in the first
scenario.

One could also argue that the code came initially with build receipts we
should not ignore. With cmake we can generate compile commands, you can
use a wrapper to catch the invocations of the compiler, ... why
shouldn't we then be able to compile the code (potentially after
preprocessing with the interceped options)?

> About inlining, you are right: there will be no inlining. To get
> around this problem, we also have a database of 15K whole files, which
> contains files with multiple functions. The programs are available
> here: http://cuda.dcc.ufmg.br/angha/files/suites/angha_wholefiles_all_15k.tar.gz

What is the benefit of the 700k single functions collection over this
one? Are these original source files or modified?

Cheers,
Johannes

Hi Johannes,

One could also argue that the code came initially with build receipts we should not ignore. With cmake we can generate compile commands, you can use a wrapper to catch the invocations of the compiler, ... why shouldn't we then be able to compile the code (potentially after preprocessing with the interceped options)?

That is a good idea. I suppose one could combine the parsing of build
files with heuristics to compile automatically benchmarks mined from
open source repositories without type inference. It would be a new
research project though. There has been some recent interest in the
automatic generation of benchmarks for predictive compilers, and there
are today different approaches to generate benchmarks. The
state-of-the-art published papers on the subject are from Edinburgh
(Cummins, Leather et al), e.g.:

* Compiler fuzzing through deep learning. ISSTA 2018: 95-105
* End-to-End Deep Learning of Optimization Heuristics. PACT 2017: 219-232
* Synthesizing benchmarks for predictive modeling. CGO 2017: 86-99

About your other questions, I will answer them in private to you. If
others are interested, let me know it, and I will copy them in the
thread.

Regards,

Fernando