[GSoC] Machine learning and compiler optimizations: using inter-procedural analysis to select optimizations

Dear all,

My name is Konstantin Sidorov, and I am a graduate student in Mathematics at Moscow Institute of Physics and Technology.

I would like to work on a project “Machine learning and compiler optimizations: using inter-procedural analysis to select optimizations” during the Google Summer of Code 2021.

I have an extensive background relevant to this project - in particular:

  • I have already participated in GSoC before in 2017 with mlpack organization on the project “Augmented RNNs”: https://summerofcode.withgoogle.com/archive/2017/projects/4583913502539776/
  • In 2019 I have graduated from the Yandex School of Data Analysis — a two-year program in Data Analysis by Yandex (the leading Russian search engine); more info on the curriculum could be also found at https://yandexdataschool.com/.
  • I have also been working as a software engineer at Adeptik from July 2018 to date, where I have predominantly worked on projects on applied combinatorial optimization problems, such as vehicle-routing problems or supply chain modeling. In particular, I have had experience with both metaheuristic algorithms (e.g., local search or genetic algorithms) and more “traditional” mathematical modeling (e.g., linear programming or constraint programming).

I would like to discuss this project in more detail. While it is hard to discuss any kind of exact plan at this stage, I already have two questions concerning this project:

(1) I have set up an LLVM dev environment, but I am unsure what to do next. Could you advise me on any simple (and, preferably, relevant) tasks to work on?
(2) Could you suggest any learning materials to improve the understanding of “low-level” concepts? (E.g., CPU concepts such as caching and SIMD)

Best regards,
Konstantin Sidorov

Hi Konstantin,

Let me try to help with the last two questions:

  1. From what I understand, you don’t have much experience with LLVM (or compilers in general? in such a case, I can recommend more stuff). Given that, I’d start with these two videos
    a) https://www.youtube.com/watch?v=m8G_S5LwlTo
    b) https://www.youtube.com/watch?v=3QQuhL-dSys

These cover fundamental concepts and you’ll have to know most of them no matter what you do in LLVM.
The next step depends on what you like. I’d say you could either play with an already existing pass or create your own.

Personally, I would go with the second although it seems more difficult (but it’s not). First, because already existing passes
are production quality, which means that they contain a lot of stuff that are not exactly educational and would not be good for a beginner IMO.
(so for example, if I were to start with an already existing pass, I’d start with one implemented for a course or sth)
Second, because by creating your own pass, you will see sort of how everything fits together much faster.

Now, the problem with creating your own pass is that most relevant tutorials out there, either they use the old pass manager
or they make an out-of-tree pass (or both). The result is that you have to deal with a monstrosity of irrelevant (at least for now) things.

So, I’ll go ahead and recommend something else: Just delete the code of an already existing pass and start over :slight_smile: In particular,
find the ::run function and just delete what it does and call your own functions. For example, take this [4] and replace it with e.g. this
https://pastebin.com/PhMLvMH7

You can now call “your” pass by adding -passes=“loop-distribute” in the opt tool.

This should give you enough knowledge of the infrastructure to e.g. program a dominators pass [5] (if you don’t know about dominators,
it’s important that you learn about them, however, wikipedia is not a good source. I’d pick a book like “Engineering a Compiler”).

  1. I would start with those two:

a) CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!" [1]
b) Writing cache friendly C++ - Jonathan Müller - Meeting C++ 2018 [2] https://www.youtube.com/watch?v=Nz9SiF0QVKY

If you want to go more in depth, you can watch this course [3] from CMU on Comp. Arch., I think it’s amazing. You can cherry pick topics if you want.

Hope this helps, let me know if there are other questions.

  • Stefanos

[1] https://www.youtube.com/watch?v=BP6NxVxDQIs
[2] https://www.youtube.com/watch?v=Nz9SiF0QVKY

[3] https://www.youtube.com/playlist?list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq
[4] https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopDistribute.cpp#L1043
[5] https://en.wikipedia.org/wiki/Dominator_(graph_theory)#Algorithms

Στις Τρί, 19 Ιαν 2021 στις 5:05 π.μ., ο/η Сидоров , Константин Сергеевич via llvm-dev <llvm-dev@lists.llvm.org> έγραψε:

Hello Konstantin,

if it helps, you may consider joining the ‘ML Guided Compiler Optimizations’ working group - next meeting is Feb 5.

Dear all,

I would like to continue the discussion of the GSoC project I mentioned in the previous email. Now, when I know my way around the LLVM codebase, I would like to propose the first draft of the plan:

  • Improving heuristics for existing passes – to start the discussion, I propose to start the project by working on MLInlineAdvisor (as far as I understand, in this class the ML infrastructure is already developed, and thus it seems to be a good idea to start there) and after that switching to the other passes (e.g., LoopVectorizationPlanner seems to be a good candidate for such an approach, and LoopRotate class contains a profitability heuristic which could also be studied deeper).
  • Machine learning models to select the optimizations – to the best of my understanding, the key concept here is the pass manager, but here I don’t quite understand the technical details of deciding which optimization to select. For this reason, I would like to discuss this part more thoroughly.

If the project mentors are reading this mailing list and are interested in the discussion, can we start the discussion here?

By the way – I would like to thank Stefanos for the comprehensive response to my previous questions that helped me to get started :slight_smile:

Looking forward to a further discussion,
Konstantin Sidorov

вт, 19 янв. 2021 г. в 07:04, Сидоров , Константин Сергеевич <sidorov.ks@phystech.edu>:

Hi Konstantin,

First of all, I’m glad that my response helped! And also that you’re still into this project. Don’t hesitate to ask any other questions.

Again, not really a ML expert here, but I just saw the reference to LoopRotate. You probably refer to this: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopRotation.cpp#L52
I just wanted to mention that although I have not done any benchmarks, I speculate that you won’t get any substantial benefit from working on this. Waiting on others’ opinions too but I think your time will
be better spent elsewhere.
(Btw, if it’s not clear what is this header duplication thing referenced there, let me know)

Good luck in your GSoC application and project!

Best,
Stefanos

Στις Σάβ, 6 Φεβ 2021 στις 12:35 μ.μ., ο/η Сидоров , Константин Сергеевич via llvm-dev <llvm-dev@lists.llvm.org> έγραψε:

Hi Konstantin,

I didn't find the time to write new GSoC projects for 2021 yet but
if you are interested we could probably set one up in this area. I
also CC'ed Mircea who might be interested in this too, maybe as
(co-)mentor.

We could look at loop transformations such as unrolling and fusion,
similar to the inliner work. Best case, we can distill a heuristic
out of a model we learned. We could also look at pass selection and
ordering. We started last year and I was hoping to continue. You
might want to watch Lightning Talks Session 3 - YouTube and
Lightning Talks Session 2 - YouTube .

In case your interested in a runtime topic, I really would love to
have a predictor for grid/block/thread block size for (OpenMP) GPU
kernels. We are having real trouble on that end.

I also would like to look at ML use in testing and CI.

Let me know what area sounds most interesting to you and we can
take it from there.

~ Johannes

Hi Konstantin,

I didn’t find the time to write new GSoC projects for 2021 yet but
if you are interested we could probably set one up in this area. I
also CC’ed Mircea who might be interested in this too, maybe as
(co-)mentor.

We could look at loop transformations such as unrolling and fusion,
similar to the inliner work. Best case, we can distill a heuristic
out of a model we learned. We could also look at pass selection and
ordering. We started last year and I was hoping to continue. You
might want to watch https://youtu.be/TSMputNvHlk?t=617
<https://youtu.be/TSMputNvHlk?t=617> and
https://youtu.be/nxfew3hsMFM?t=1435 <https://youtu.be/nxfew3hsMFM?t=1435> .

In case your interested in a runtime topic, I really would love to
have a predictor for grid/block/thread block size for (OpenMP) GPU
kernels. We are having real trouble on that end.

Hi,

We are working on an ML model that can predict the profitability of offloading a kernel to GPU. I feel that this problem is very much related. This problem will have the same challenges in terms of feature engineering and data preparation the one we are handling for our work.

Abid

Hello Johannes,

I guess working on the loop transformations is a good starting point – firstly, it is similar to the already existing code, and secondly, this problem doesn’t look way too hard (especially comparing with other ideas). To the best of my understanding, it corresponds to refactoring LoopUnrollAnalyzer and, if I understood you correctly, MacroFusion, in the similar way it has been done with the InlineAdvisor.

As for the next step, I think that knowledge distillation is a promising idea – in fact, we can experiment with the approaches from [1], which can yield a nice inference speed-up in those models.

I think working on some kind of unified pipeline for pass selection and ordering is also an interesting idea – off the top of my head, a viable approach here is to consider a pass scheduling as a single-player game and running a Monte-Carlo tree search to maximize some objective function. For example, in [2] this kind of approach is used for learning to solve vertex cover and max-cut, while [3] employs this approach for searching for the molecule design with the specified properties. See also [4] for a survey of RL methods (including MCTS) for combinatorial problems.

In case your interested in a runtime topic, I really would love to
have a predictor for grid/block/thread block size for (OpenMP) GPU
kernels. We are having real trouble on that end.

I’m afraid I didn’t quite understand this one – could you elaborate a bit more on this topic?

Best regards,
Konstantin Sidorov

[1] https://github.com/lhyfst/knowledge-distillation-papers#recommended-papers
[2] Abe, Kenshin et al. Solving NP-Hard Problems on Graphs with Extended AlphaGo Zero. https://arxiv.org/abs/1905.11623
[3] Kajita, S., Kinjo, T. & Nishi, T. Autonomous molecular design by Monte-Carlo tree search and rapid evaluations using molecular dynamics simulations. Commun Phys 3, 77 (2020). https://doi.org/10.1038/s42005-020-0338-y
[4] Mazyavkina, Nina et al. Reinforcement Learning for Combinatorial Optimization: A Survey. https://arxiv.org/abs/2003.03600

сб, 6 февр. 2021 г. в 23:28, Johannes Doerfert <johannesdoerfert@gmail.com>:

Hi Konstantin,

Hello Johannes,

I guess working on the loop transformations is a good starting point –
firstly, it is similar to the already existing code, and secondly, this
problem doesn't look way too hard (especially comparing with other ideas).
To the best of my understanding, it corresponds to refactoring
`LoopUnrollAnalyzer` and, if I understood you correctly, `MacroFusion`, in
the similar way it has been done with the `InlineAdvisor`.

That sounds good to me. As you said, we'd have a nice "template"
to guide such exploration. Given that GSoC is short this year that
sounds certainly better than a big standalone project.

As for the next step, I think that knowledge distillation is a promising
idea – in fact, we can experiment with the approaches from [1], which can
yield a nice inference speed-up in those models.

What I meant was that it would be nice if we not only come up with
an ML advisor but also an improved non-ML predictor based on the
insights we can extract from the model. I'm not sure if that came
across right.

I think working on some kind of unified pipeline for pass selection and
ordering is also an interesting idea – off the top of my head, a viable
approach here is to consider a pass scheduling as a single-player game and
running a Monte-Carlo tree search to maximize some objective function. For
example, in [2] this kind of approach is used for learning to solve vertex
cover and max-cut, while [3] employs this approach for searching for the
molecule design with the specified properties. See also [4] for a survey of
RL methods (including MCTS) for combinatorial problems.

The caveat is, we actually don't want to learn arbitrary pass
pipelines because they are simply not practical. That said, we
could define/learn building blocks which we then combine. I would
however start small here was well, e.g., take a canonicalization
pass, e.g., simplifycfg, and learn when in the pipeline we should
run it, potentially based on IR features.

In case your interested in a runtime topic, I really would love to

have a predictor for grid/block/thread block size for (OpenMP) GPU
kernels. We are having real trouble on that end.

I'm afraid I didn't quite understand this one – could you elaborate a bit
more on this topic?

So the problem is as follows:

Given a GPU kernel, how many thread blocks and threads per thread
block should be use. This can heavily impact performance and so far
we use semi-arbitrary numbers in the OpenMP GPU offloading runtime.

We would use IR information, PTX information, GPU information, and
kernel parameters, to learn and eventually make a decision.

Does that make more sense?

~ Johannes

P.S. Given that we are defining a project via email I will not write
it up for the open project page. Instead, I expect you simply apply
to GSoC with some description we come up here.

Hi Konstantin,

I didn’t find the time to write new GSoC projects for 2021 yet but
if you are interested we could probably set one up in this area. I
also CC’ed Mircea who might be interested in this too, maybe as
(co-)mentor.

Sorry for being slow answering this - happy to help if there’s the need!

A suggestion for a potential project: bringing this work to LLVM in a manner similar to what we did for the inliner - i.e. transparently integrating a general-purpose learned policy in the compiler.