llvmlab (phased buildmaster) is in production mode!

Hello LLVM Dev and Clang Dev!

David Dean and I just finished bringing up a new build master on lab.llvm.org, llvmlab, which is located at the url http://lab.llvm.org:8013 and is in #llvm under the username llvmlab.

llvmlab is different than the current buildbot based continuous integration systems llvm uses; llvmlab is a phased builder based system. The high level details of the phased builder system are as follows:

  1. Builders are grouped together into phases as sub builders. Each phase is a builder itself and triggers said grouped ``sub builders’’ and assuming that all of the sub builders complete, triggers its successor phased builder if one exists. This creates a gating like effect.

  2. All phases are gated together in a linear fashion, yielding via the linearity the phenomena that if an earlier phase does not succeed, then no later phases run. The key idea here is if we know that there is a fundamental issue with the compiler why try to build 20 compilers, when performing one quick build is all that is needed to ascertain such a fact? Also if we can not build a compiler successfully, why try to do LNT performance runs? This gets rid of pointless work, stops excessive emails from being sent out for 1 bad commit, and reduces cycle time especially if certain builds take significantly longer than others to fail.

  3. Later phases do broader, longer lasting testing than earlier phases. Thus the 4 phases we currently have are:

a. Phase 1 (sanity): Phase 1 is a quick non-bootstrapped, non-lto compiler build, to check the basic sanity'' of the code base and build process. This generally takes 15-20 minutes to complete. b. Phase 2 (living on): Phase 2 builds bootstrapped compilers for the different configurations that can be used for living on’’, i.e. good enough for common compilation tasks. This is meant to cycle in up to an hour.
c. Phase 3 (tree health): Phase 3 runs performance tests, i.e., LNT (which are not live yet) as well as other compiler builds which take a longer amount of time (a full clang build with LTO enabled for instance).
d. Phase 4 (validation): Phase 4 runs longer running validation tests. Currently we have nothing in phase 4, but I am sure that will change = p.

  1. Builders in later phases rely on outputs from earlier phases. If we are doing performance runs, why should we compile a new compiler for such a performance run? This is duplicated work! Instead the phased build system stores ``artifacts’’, i.e., built compilers, in earlier phases and uses them as compilers for builds in later phases. Thus we could have 30 different Phase 3 LNT performance runs with different configurations all using the same compiler artifacts built in phase 2. This significantly deduplicates work yielding a decreased cycle time.

As time moves on we will be moving more and more builders to llvmlab including LNT performance builders and a builder which runs the libcxx test suite.

Michael

Most of the selections, like "console" for example, do not work when I click on them.

Cool! This is great news.

I feel like this information should be in our documentation somewhere. Could you start a new file ContinuousIntegration.rst and use this content to seed it? This new page would also be a good place to mention some LLVM idiosyncrasies like smooshlab being Apple-internal but still reporting via IRC; these things have not had a good place to be put yet. AFAIK currently our continuous integration infrastructure is mostly community wisdom and besides a small mention of some of the reporting bots on index.rst there is no documentation describing it.

– Sean Silva

While most of this sounds great, this one really doesn't.

The sanity tests should be able to run *much* faster than 15-20 minutes.
Can we prioritize getting an incremental rebuild bot as the sanity phase on
reasonably fast hardware? I think it's important to get through the sanity
phase in 1-5 minutes so that phase 2 doesn't get soooo many commits piled
up on it when someone checks in code with both a miscompile and a tiny
small build break (for example).

-Chandler

Smooshlab is going away so the LLVM idiosyncrasy that you mention will be going away soon = p. The whole idea behind bringing up this infrastructure is so that everything that is Apple-internal but COULD be public is public and is on the phased builders. Anything we can't bring out will (of course) stay internal. Also I agree about the documentation.

Michael

This is a known issue. I have actually screwed with this before it is just a matter of getting it into zorg in the right manner.

Agreed. I think getting the sanity time down further is possible and very important. IIRC gribozavr has a ninja cmake bot that does clean builds in < 10 minutes. I think that that is a first step (bringing me to your question).

Incremental builds are quicker but less robust than clean builds. The nice thing about always doing clean builds is that it simplify things by throwing out the potential of any build failures due to a dirty build directory (but maybe I am paranoid). If we want to do incremental builds (which note I am not averse to btw), we should at least set up some manner to clean the build directory if we get a build failure due to a dirty build directory that does not involve sshing into the machine. Perhaps if a builder fails a number of times in a row, clean it?

Also, hardware contributions are always welcome = p.

Michael

3. Later phases do broader, longer lasting testing than earlier phases.
Thus the 4 phases we currently have are:

a. Phase 1 (sanity): Phase 1 is a quick non-bootstrapped,
non-lto compiler build, to check the ``basic sanity'' of the code base and
build process. This generally takes 15-20 minutes to complete.

While most of this sounds great, this one really doesn't.

The sanity tests should be able to run *much* faster than 15-20 minutes.

Agreed. I think getting the sanity time down further is possible and very
important. IIRC gribozavr has a ninja cmake bot that does clean builds in <
10 minutes. I think that that is a first step (bringing me to your
question).

Can we prioritize getting an incremental rebuild bot as the sanity phase
on reasonably fast hardware?

Incremental builds are quicker but less robust than clean builds. The nice
thing about always doing clean builds is that it simplify things by
throwing out the potential of any build failures due to a dirty build
directory (but maybe I am paranoid). If we want to do incremental builds
(which note I am not averse to btw), we should at least set up some manner
to clean the build directory if we get a build failure due to a dirty build
directory that does not involve sshing into the machine. Perhaps if a
builder fails a number of times in a row, clean it?

It's common to have buttons on the builder page that force the next build
to be clean. Usually they have some kind of auth mechanism. Maybe you can
get by with something simple.

Also, hardware contributions are always welcome = p.

Can you do add clean-after-error behavior? I.e. default to incremental
build, but do a clean rebuild after a failing build?

Joerg

(I said the same thing in my last email = p, except a bit more conservative: do a clean build after 3 failures or something along those lines).

Michael

(I said the same thing in my last email = p, except a bit more conservative: do a clean build after 3 failures or something along those lines).

Right, but the difference is significant:

Clean on failure would mean we would never see a build bot failure from a bad incremental rebuild. Rather than getting a couple of builds worth of noise.

It has the disadvantage that we’d increase latency of any failure result, unfortunately.

Personally I think that any incremental build failure should be a bug that we fix, which should make the not go green without ever having to force clean it. Is there a reason this isn’t possible or desirable?