GSoC on LLVM usability?

Hello all,

My name is Thibault Raffaillac, degree student at KTH, Stockholm, Sweden (in double-degree partnership with Ecole Centrale Marseille, France).
I am currently carrying my degree thesis on the usability of compilers, and as such would be very much interested in contributing to LLVM through a Google Summer of Code.
The task I would like to propose is implementing the work from my thesis: a user-friendly feedback of the optimizations performed (http://www.csc.kth.se/~traf/traf-sketch.pdf).
If this can be of interest (or unclear), please tell me so that I develop further the project application.

Best regards,
Thibault (http://www.csc.kth.se/~traf/)

Bump!

I realize my previous message may have been a bit abrupt, without much proper
explanations, though it is a serious application!

Currently, the feedback of LLVM regarding the optimizations performed is an
optional list of optimization names, through the detailed pass statistics.
Though of high relevance to a user familiar with optimizing compilers, for
unfamiliar users it might be overwhelming: one will have to “google” every
single name to learn about the different techniques.

Even for most expert users, which have not had a glimpse at LLVM source code,
it is fairly hard to guest whether some specific sequence of instructions will
be optimized as expected. docs/Passes.html clearly states the behaviour of each
optimization, but nothing guarantees that it will behave as expected. There
should be a closer feedback to was is actually done.

I am thinking about a user-friendly feedback of the optimizations performed.
It would not be meant as an exhaustive feedback as the detailed pass
statistics, but rather pick one or two optimizations done during the previous
compilation, and explain them.

As an example:
cc --feedback test.c
test.c:xx:x: info: Value Propagation: All operands being constant, ‘2560’ was directly assigned to ‘a’
test.c:xx:x: info: Value Propagation: Impossible to proceed here, because…
test.c:xx:x: info: MemCpy Optimization: The consecutive sequence of 6 stores was merged in a single memcpy
test.c:xx:x: info: Combine redundant instructions: As integers are stored in binary format, ‘* 8’ was replaced by ‘<< 3’

As a difference with the pass statistics, here the messages would form a set,
and the system would remember those already displayed and decrease their
frequency of occurence between compilations. All messages would cite the pass
name for search in the docs, explain what triggered the optimization, and
describe the consequence. Actually I think the messages could be a little
lengthier than the example (2-3 lines) to make proper explanations.

As for the work plan, it would consist in:
_ Enumerating all possible messages in the messages set.
_ Implementing a function receiving feedback from each optimization unit and
choosing whether to display it: info_printf(enum INFO_INDEX, const char*, …);
_ Write a formatting guide for adding messages in the set.

I would very much appreciate you feedback on this proposal!

Best regards,
Thibault (http://www.csc.kth.se/~traf/)

Hi Thibault,

I think in general this looks very interesting and helpful, but making this actually work seems to be difficult. Here my two major concerns:

1. Relating LLVM-IR to the source code is difficult

Especially when optimizations are enabled relating LLVM-IR to the original source code is difficult. How do you plan to give the user feedback that he can relate to the code he has written?
How do you describe transformations that are only possible due to previous transformations and that possibly do not have a related piece of original source code?

2. There will be a lot of messages

LLVM performs a lot of transformations even if you just look at instcombine. Which transformations are you planning to describe and how will you make sure the user is not flooded by messages?

Cheers
Tobi

Hi Tobias,

Thank you for replying!

1. Relating LLVM-IR to the source code is difficult

To relate LLVM-IR to the source code, I am simply considering using the
debug information which remains valid along optimizations and analysis.
Regarding transformations which might not relate to any original source code, I
think this is to be solved on a per-case basis, but will hopefully be greatly
eased by the debugging information format (at least if a transformation is
impossible to explain it will be a valuable feedback on the debug format!).
Also, some might simply not need to be exposed to the user, as they do not
refer to a standard compilation technique. This is why my proposal involves
enumerating the optimizations which will yield a message to the user, then
compose these messages.

2. There will be a lot of messages

About the potential huge amount of messages, I would assign a "type"
(represented by enum INFO_INDEX in the prototype for info_printf) to each
message. The compiler would remember which types previously appeared, and favor
the types which appeared the least. It would also handle simple dependencies,
like not introducing Loop-Closed SSA before SSA itself. As you may be concerned
about performance, the system would not store all messages before selecting the
ones to display, but rather display them or cast them out as they come.

(3). making this actually work seems to be difficult

As for the difficulty to implement this work, I have tried to propose the
simplest reasonable approach:
_ The necessary code to write (the heuristic selecting types to display) is
  contained in a single independent function.
_ I do not intend to browse and update every optimizing component. The original
  programmers are best to explain their own creations, so I will just compose
  detailed sample messages and provide usability insights.

You might think that this task underestimated in terms of coding, but the
actual difficulty will be to compose the messages. Indeed, for example how can
one give a glimpse of SSA by relating directly to the user code? In that case I
would make a reassignment trigger the message:
test.c:xx:x: info: Static Single Assignment: Reassignment of 'a' is considered as a new variable, consequently a store operation was saved by discarding its previous value"

The point is not to make an exhaustive description of the optimization
concerned, but rather to foster the user curiosity who will "google" the term
we put at the head of the message.

A revised work plan could be:
_ Enumerating all possible messages in the messages set.
_ Implementing a function receiving feedback from each optimization unit and
  choosing whether to display it: info_printf(enum INFO_INDEX, const char*, ...);
_ Updating one or two optimization blocks to display such feedback, to test
  info_printf
_ Write a formatting guide for adding messages in the set.

Regards,
Thibault