Potential Google Summer of Code Applicant

Hi, my name is Patrick Edwards, and I’m currently a CS major at Kent State University. I have always been interested in doing work with compilers and LLVM seems to be a perfect fit for me to learn more over the summer, and also contribute to open-source projects at the same time. However, while browsing through the project ideas, the only ideas I found accessible were the code reduction and compile with/benchmark the LLVM compiler. I would really love to help LLVM, as I have used C++ in quite a few of my classes and side programs, not to mention learning other oddball languages as I wanted. If possible, could someone please point me in the right direction to contribute to LLVM in the best way I can?

Thank you in advance,
Patrick Edwards, potential GSoC applicant

Hi Patrick,

besides the open project pages of the LLVM itself [1], there are also the ideas list of the subprojects:

- The clang open projects list [2]
- The SAFECode open projects list [3]
- The Polly todo list [4]

As you already realized, not all of them are suiteable for summer of code, but some of them definitely are.

A topic I personally would think might be an interesting summer of code project is a clang based gnu 'indent' replacement. Here the idea from the clang open projects page:

"Use clang libraries to implement better versions of existing tools: Clang is built as a set of libraries, which means that it is possible to implement capabilities similar to other source language tools, improving them in various ways. Three examples are distcc, the delta testcase reduction tool, and the "indent" source reformatting tool. distcc can be improved to scale better and be more efficient. Delta could be faster and more efficient at reducing C-family programs if built on the clang preprocessor, indent could do proper formatting for complex C++ features, and it would be straight-forward to extend a clang-based implementation to handle simple structural rules like those in the LLVM coding standards."

I am not an expert in clang itself, but on the clang mailing list, you there are plenty of people who are. I copied Manuel, as I think he was already planning to work on something like this.

Cheers
Tobi

[1] http://llvm.org/OpenProjects.html
[2] http://clang.llvm.org/OpenProjects.html
[3] SAFECode: Open Projects
[4] http://polly.llvm.org/todo.html

+cfe-dev, +chandlerc

Hi, my name is Patrick Edwards, and I'm currently a CS major at Kent
State University. I have always been interested in doing work with
compilers and LLVM seems to be a perfect fit for me to learn more over
the summer, and also contribute to open-source projects at the same
time. However, while browsing through the project ideas, the only ideas
I found accessible were the code reduction and compile with/benchmark
the LLVM compiler. I would really love to help LLVM, as I have used C++
in quite a few of my classes and side programs, not to mention learning
other oddball languages as I wanted. If possible, could someone please
point me in the right direction to contribute to LLVM in the best way I
can?

Thank you in advance,
Patrick Edwards, potential GSoC applicant

Hi Patrick,

besides the open project pages of the LLVM itself [1], there are also the
ideas list of the subprojects:

- The clang open projects list [2]
- The SAFECode open projects list [3]
- The Polly todo list [4]

As you already realized, not all of them are suiteable for summer of code,
but some of them definitely are.

A topic I personally would think might be an interesting summer of code
project is a clang based gnu 'indent' replacement. Here the idea from the
clang open projects page:

"Use clang libraries to implement better versions of existing tools: Clang
is built as a set of libraries, which means that it is possible to implement
capabilities similar to other source language tools, improving them in
various ways. Three examples are distcc, the delta testcase reduction tool,
and the "indent" source reformatting tool. distcc can be improved to scale
better and be more efficient. Delta could be faster and more efficient at
reducing C-family programs if built on the clang preprocessor, indent could
do proper formatting for complex C++ features, and it would be
straight-forward to extend a clang-based implementation to handle simple
structural rules like those in the LLVM coding standards."

I am not an expert in clang itself, but on the clang mailing list, you there
are plenty of people who are. I copied Manuel, as I think he was already
planning to work on something like this.

While it is a topic where we'll want to see major stuff going on over
the coming year, I'm not sure that the underlying infrastructure will
meet the GSoC timeline - we're currently in the process of getting the
basics into mainline, and I think it would be a pain to now start
writing larger amounts of codes on something that we'll need to change
later.

Cheers,
/Manuel

Ah OK. I would have loved to have such a tool, but as a non-clang expert, I can obviously not judge if it is suited. Maybe you are aware of other projects suitable for GSoC,

Tobi

Ah OK. I would have loved to have such a tool, but as a non-clang
expert, I can obviously not judge if it is suited. Maybe you are aware
of other projects suitable for GSoC,

But still many things in "Use clang libraries to implement better
versions of existing tools" are doable as it seems to me. E.g. delta
replacement based on clang.

Yes, thanks Anton for clarifying: my comment was only in reference to
the "indent" formatting tool.

Hi, my name is Patrick Edwards, and I’m currently a CS major at Kent State University. I have always been interested in doing work with compilers and LLVM seems to be a perfect fit for me to learn more over the summer, and also contribute to open-source projects at the same time. However, while browsing through the project ideas, the only ideas I found accessible were the code reduction and compile with/benchmark the LLVM compiler. I would really love to help LLVM, as I have used C++ in quite a few of my classes and side programs, not to mention learning other oddball languages as I wanted. If possible, could someone please point me in the right direction to contribute to LLVM in the best way I can?

Here are two random ideas I’ve had:

  1. Run tests to see how well Google’s Go language works with Dragonegg. Fix bugs that you find. Measure performance of Dragonegg-compiled Go programs compiled over GCC-compiled and 6g compiled Go programs.

  2. Create a Clang front-end for Google Go. This would enable static analysis to be built for Google Go just as they are built for C/C++/Objective-C now.

If you haven’t already, you might want to investigate the Open Projects pages of LLVM sub-projects such as compiler-rt, SAFECode, polly, klee, etc.

– John T.

Ah OK. I would have loved to have such a tool, but as a non-clang
expert, I can obviously not judge if it is suited. Maybe you are aware
of other projects suitable for GSoC,

But still many things in "Use clang libraries to implement better
versions of existing tools" are doable as it seems to me. E.g. delta
replacement based on clang.

+1 as a would-be user of a tool like that.

Ah OK. I would have loved to have such a tool, but as a non-clang
expert, I can obviously not judge if it is suited. Maybe you are aware
of other projects suitable for GSoC,

But still many things in "Use clang libraries to implement better
versions of existing tools" are doable as it seems to me. E.g. delta
replacement based on clang.

+1 as a would-be user of a tool like that.

Actually John Regehr & others have implemented something like this
called CReduce ( GitHub - csmith-project/creduce: C-Reduce, a C and C++ program reducer ). It's
still a little limited in its C++ transformation abilities but seems
to provide a fairly extensible foundation & is already proving quite
useful. I'm not sure whether there's sufficient meat in there for a
GSoC project to improve CReduce's abilities in C++ (or other
dimensions) - but there's certainly some open work there, it seems.

- David

Hi David et al.,

We could certainly use help in writing reducer passes for C++. My student Yang Chen has just (in the past 2 weeks or so) launched into doing these, but there's definitely enough work for more people.

I would certainly be willing to help supervise reducer work if it got approved as a GSoC project.

If, in some ideal world, it became possible to free up Yang to work on other things, I would ask if he wanted to work on a random C++ program generator (C++smith or whatever). My sense is this would be pretty useful for the LLVM and GCC commnities.

John

I would really like to see someone work on LLVM’s garbage collection support - it hasn’t been updated in 4 years, and while there’s been a lot of talk about ways that it could be improved, there’s been no action.

That is *sooo* true! :slight_smile: I'm one of the authors of an LLVM backend for Erlang (ErLLVM [1]); we have tested and measured our backend and noticed that with the current GC infrastructure we see 20-40% performance degradation (because of the loads/stores on the stack for all gcroots). It is clear to me and the rest of the team that with this infrastructure the LLVM might not be suitable for languages with explicit garbage collection, like Erlang. I've also studied the way the Vmkit project handles GC and they seem to face the same deficiency too.

offtopic: I am working on an email (more like an RFC) with all the details and patches to the LLVM project in order to support our Erlang backend. I hope I will be able to send it by next week. Note, that we have already talked with the Ericsson/OTP team about integrating our work in a future release of Erlang/OTP (as a new HiPE backend).

Yiannis

[1]: http://erllvm.softlab.ntua.gr

Hi,

I'm currently working for the next 6 months, but I would be very interested in looking into this. Are there any discussions in this mailing list that would be useful in finding out more information?

Regards

Michael Thorpe
Internet Services Developer
Netcraft Ltd

Sorting through all of the discussions would be difficult, as the ideas have morphed over the years. Also, some of the discussion took place offline at various LLVM dev conferences.

I can summarize the main points here:

The biggest improvement in GC would be to allow SSA values to be declared as GC roots - currently only alloca values, that is, values in memory, can be GC roots. This means that frontends need to copy traceable values to memory before calling any function that might trigger a collection. This makes the resulting IR complex and inefficient.

Chris’s proposal is that we add metadata to the Pointer type (perhaps using the address space field) to indicate that a pointer is a GC root. Note that this would not be a replacement for the existing llvm.gcroot intrinsic, but rather an addition to it - that is, you could choose to use either the Pointer attribute or the llvm.gcroot intrinsic. The reason for this is that for some languages, there are non-pointer GC roots (my language has discriminated unions that sometimes contains pointers and sometimes not.) These frontends would need to continue to use the existing intrinsics, but fortunately such cases are fairly rare and this is not a terrible burden. For most languages, the Pointer attribute would be a much easier way to deal with roots, since the “root-ness” of a value is simply a property of it’s type, and so gets carried along with the value automatically.

Thus, in this new scheme, Pointer types which were marked as roots could be traced regardless of whether they lived within a memory location or a register. These could be used to generate either stack maps or register maps. Note that these maps are language-specific and are implemented via a language-specific plugin, so you would need to update the current API to deal with items in registers. (An alternative plan is to allow the frontend to request via a flag that roots be copied to local variables at each safe point, so that register maps would not be needed. This alternate plan might be a good first step, and then move on to the more difficult problem of register maps if the first step is successful.)

The reason why this is difficult is that the presence of garbage collection roots has a major impact on the optimizer and backend code generators - certain optimizations can cause the stack map to be incorrect, so these optimizations must be prevented or compensated for. However, an advantage is that variables which are optimized away no longer need to be included in stack maps - something that is not possible with the current approach.

One other limitation of the Pointer approach over the existing llvm.gcroot system, is that the latter allows complex metadata to be associated with each root. This is useful in languages that don’t use tagged objects, that is, the type of every object it known at compile time. However, creating a metadata pointer for every Pointer type would be expensive, so the Pointer roots would only be used for languages which use tagged objects - which is fortunately most languages that use GC.

An even more ambitious plan would be to allow structs in SSA values to be declared as roots, which would be useful for languages like mine. We wouldn’t use register maps for these, since a struct might get “sliced” into multiple registers. Instead, the code generator would automatically spill the struct value to memory during a safe point, and otherwise treat the struct like the existing llvm.gcroot() intrinsic does. Note that this is a much harder problem, and would not be needed for languages like Java or Python where objects are always passed by reference. I wouldn’t expect an initial implementation to attempt to tackle this harder problem.

That would be the biggest improvement that I can think of. There are a few other minor improvements that I would also like to see:

One would be a set of intrinsics that would allow efficient iteration of stack frames in an efficient manner. The existing LLVM stack frame intrinsics are inefficient and cannot be relied upon in many cases. Basically you’d want 3 intrinsics, which would work on all supported platforms: The first would return an opaque reference to the current stack frame; The second would take a stack frame as it’s argument and return a pointer to it’s parent stack frame; And the third would take a stack frame argument and return a base address for the local variables of that frame. The lanuguage specific runtime would then use this base address, along with the generated stack maps, to access all of the stack roots for that frame.

An example of how this stack walking is done can be seen here: http://code.google.com/p/tart/source/browse/trunk/runtime/lib/gc_common.cpp#155 However, this code only works on x86 - the intrinsics that I envision would work on a much wider set of backend targets.

Note that these items are just a tiny part of a complete collector, however, the design of LLVM is that each language is supposed to implement its own collector, and LLVM only supplies the parts that need to be integrated into the code generator.

I can also suggest ways to test the new features without having to build a complete garbage collector. For example, one can create a trivial stack walker that merely counts the number of non-null root pointers, and write various unit tests that verify that the results are as expected.

I realize that this was written in a hurry, and may not have been entirely clear. If there are any questions, critiques, etc., I’d be happy to respond to them. I’d really like it if LLVM’s garbage collection support didn’t continue to languish…

Just a thought, but it would it make sense for garbage collection to be some sort of minimal debug information for potentially optimized code. Store just enough debug information to reconstruct call stacks and know where gc-roots are. Perhaps an approach like this could minimize the work required as it is shared between gc-support and debug information support. From what I understand, DWARF exception handling is similar in that it makes use of similar information to understand where things are during unwinding.

Actually, I’m pretty happy with the way that LLVM handles this aspect of garbage collection now. LLVM does not itself generate any data related to garbage collection - all it does is supply a plugin interface that lets your code know where on the stack the roots are. Your code is responsible for generating any static data structures that would be read by your garbage collector. So if you wanted to generate something that is DWARF-like, you could certainly do so. Note that you probably want something that is optimized for speed, and decoding DWARF DIEs is fairly involved. In general I’ve found that it’s not all that difficult to generate this data, and the size of the data isn’t all that large, especially if you take advantage of the fact that a lot of functions can share the same static data, since they have identical stack layouts.

Note that tracking stack roots is the only thing that LLVM does for you in the way of garbage collection - your compiler’s runtime is responsible for managing the heap, implementing tracing, and all of the other tasks related to garbage collection. And this is fine, for several reasons: First, because different languages are going to have different strategies for garbage collection, and second, because the tracking of stack roots is the only place where the compiler really needs to be involved (well, that and read/write barriers if you are doing an advanced collector.) Everything else is just library code.

So I have no complaints about the output of LLVM’s GC support. It’s the input side where I think serious effort is needed. The method for telling LLVM which variables are roots and which are not is clunky and has a tacked on feel.

The way things work now is that there is an intrinsic function called llvm.gcroot which is a “do-nothing” function - it takes one argument (which must be a pointer that was created by an Alloca instruction) and simply returns that value unchanged. However, the presence of this function call is used as a “marker” by the various code generators and other backend passes, letting them know that this variable should be treated as a root. The advantage of doing it this way is that it avoids having to store a bit in the value itself (because bits are precious). The downside, however, is that you can’t easily propagate the root-ness attribute from one value to another - so if you load that value into a register / SSA value, the root-ness of the value is lost. In order to get around that problem, the frontend has to do some fairly convoluted logic - which could all be avoided if we had a better way to track roots.

In virtually all modern languages, garbage collection is driven by type - so if you have a pointer to an Object, you know it’s garbage collectable, and you know how to trace it. And because types travel along with their values, so that copying from one variable to another preserves the type information, it’s easy for the garbage collection information to go along for the ride. Furthermore, it doesn’t matter whether the value is in memory or is in a register, the type tells you all you need to know.

So it seems obvious that instead of marking specific values as garbage collectible, we should instead be marking types.

(Of course, there are occasionally instances where a given type may be conditionally garbage collectible (such as Java’s “permgen” objects which live in permanent memory). But that’s an edge case that can easily be handled by a number of techniques. Telling the compiler that a given type is garbage collectable in no way compels the runtime library to actually trace it.)

You are probably correct about consuming the debug information at run-time, I think where my idea might be useful is only having one set of details about where stuff ends up after optimization passes get run. In this respect the debug information should have enough information to tell you where a particular value ended up at each point in the program and I believe it could accurately tell you its type too which could help with the propagation issues you mention. From their, a compiler specific post processor could build the required run-time tables. If an approach like this where feasible, it would allow the GC code to share bug fixes and test coverage with debug information which I imagine gets a good amount of attention.

Jim Grosbach <grosbach@apple.com> writes: