LLVM Infrastructure

Folks,

I wanted to understand what's the foundation plans on infrastructure.

According to the announcement [1], the foundation is responsible for
overseeing all LLVM tools, websites and general infrastructure.

Although some work was done already, I feel most of them were to
remediate the worst immediate problems. I believe some investigation
is necessary, and shared in the right forum. It might not be this
list, but I'll take the risk of being wrong here, rather than start
another giant thread on the main lists.

Basically, I'm proposing a three step process for all our infrastructure:

1. Investigate the problems and solutions we have, potentially asking
on the main lists.
2. Propose a set of changes (publicly or privately), from cheaper to
most expensive, and make the cut on what we can afford.
3. Draw a plan, and when publicly visible changes are due, make sure
the proposal and schedule are fair, and do it.

For example, changing the web server from cloud to cloud makes no
difference, so it doesn't need to be a public process, but changing
our repository provider may affect a lot of internal processes, and
people will react badly if nothing is shared with them beforehand, so
we need some exposure beforehand.

The areas I think we must improve:

  A. Code repositories

The SVN server is reasonably unstable, having outages that affect all buildbots.

There was a migration earlier this year, I don't know if the repo was
involved, but we still had an outage a few weeks ago. I don't know
what the cost of hosting our own "stable enough" repository versus
paying some other SVN hosting company to do that for us.

The benefit of using code hosting companies is that they have a
larger, distributed and more stable infrastructure specifically
tailored to code hosting, which is something that would be very
expensive for us to do. But we may get away with slightly more than
what we have today and not pay too much.

In my view, this requires a deeper investigation, list of prices
versus features on varied providers to make an informed proposal.
It'll also require community involvement, as this changes the core of
what we do.

Same goes for the Git repo, which is a lot better than SVN, but could
reduce costs if we moved to a FOSS friendly host.

  B. Buildbots

Our current build master is *very* slow. Buildbot itself is slow, I
know, but with the number of people we have looking at those bots, it
can take several seconds to a minute to get a page back. We may need
an individual server (cloud instance) for this, and scale as
necessary.

Also, the current master is on version 0.8.5, which besides being
ancient, has several drawbacks:
* It doesn't support newer SQLAlchemy, and it has trouble with newer
buildslaves that use them.
* It doesn't support submitting patches to a particular build, making
pre-commit buildbots impossible.
* The new versions have better support for Windows builders

Plus a huge list of changes around SVN/Git actions, authentication,
RSS feeds, stability, interaction with other systems (Gerrit, GitHub,
etc).

But a migration to a newer build master would probably require a large
scale Zorg refactoring.

I don't know the costs of the migration, nor I know the costs of the
current solution. We need a clear picture and maybe propose a few ways
out of stagnation:
- Start a new master elsewhere, slowly move the bots towards it
- Refresh the master in compatibility mode, slowly move slaves, switch
- Deprecate buildbots and move everyone to Jenkins?

Whatever works for everyone.

  C. Bugzilla

Bugzilla is great, I'm one of the weirdos that actually like it. But
our bugzilla is also severely outdated, and the internal organisation
disconnected with how the project evolved over the years.

I think we should update our Bugzilla purely in the interest of bug
fixes and security issues, as I don't know anything that we may want
from a newer version. Some other people might...

Also, if we're going to be writing scripts to automate creating bugs
and scanning through Bugzilla web services, it may be a bit more
loaded than it is now, and I don't think it's great as it is.

Since this is mostly a web service anyway, upgrading the version
should bear very little community impact, so it's more about the cost
of a new server (cloud instance) and the migration itself than
anything else.

  D. Phabricator

Phab is a good tool, but I believe we have a copy that has been
modified to suit our needs and thus diverged from whatever it is
upstream. I think it's ok to modify our tools, but very little effort
has been put into finishing the modifications, for example,
understanding inline email replies. This is not a trivial task, but
far too many people have had this problem in the main Phab Phab, and I
remember reading that, due to our changes, it may not be easy to
upgrade.

I personally think that local modifications without an effort to
upstream is against the goals of FOSS communities and we shouldn't be
promoting it ourselves. All in all, I think this deserves at least a
report on how good/bad it is, and how we should improve the bad things
(ie. lack of updates).

  E. Others

I believe that the email and web pages infrastructure is now good
enough and fit for purpose, but it also needs to be part of the
overall plan (below). If I'm not mistaken, this is the server that was
just upgraded, so that shows some progress, which is highly welcomed.

  Z. Update Plan

In the end, I think we need to think about how much work to put into
updating the infrastructure, from moving to new servers, to updating
software, to changing into new solutions. Since this is something that
can disrupt the community at large (either updating or not), this
should also be shared with the larger community, so that we have a
clear picture of what to expect and what to request, if serious
problems arise.

Until now, we've been very relaxed with using tools, like when
Chandler introduced Phabricator. I think that's a perfectly valid way
of introducing new concepts and tools, but not so much in maintaining
them. But the more we depend on the tools we add, the more care we
need to put in keeping them available, fast, up-to-date and secure.

With all that in mind, my question is: what is the Foundation's plan
for the core LLVM's infrastructure maintenance and improvements, and
how can the rest of the community help in defining and implementing
those plans?

cheers,
--renato

[1] The LLVM Foundation - The LLVM Project Blog

Just my 2 cents: the recent outage of the repositories was caused by
DDoS targeted towards viewvc. There were no outages at the time
(almost 2 months) when viewvc was disabled. Recently viewvc was
reenabled with bunch of tweaks and also we're having some whole /16
networks banned to stop DDoS.

Just my 2 cents: the recent outage of the repositories was caused by
DDoS targeted towards viewvc. There were no outages at the time
(almost 2 months) when viewvc was disabled.

There was certainly an outage somewhere between SVN and Buildbots a
few weeks ago, as I commented with Diana about it and she's just
joined Linaro last month.

Though, the buildbot logs are all gone now, and I can't show it to
you. Which also reminds me that we could have some cloud MRTG service.

Recently viewvc was reenabled with bunch of tweaks and also we're having some whole /16
networks banned to stop DDoS.

Do we have any idea why are these DDoSs targeting our SVN web
interface and nothing else? Didn't we have a similar problem with
Baidu's extremely impolite crawler?

Btw, viewvnc seems pretty stable now, thanks!

cheers,
--renato

Do we have any idea why are these DDoSs targeting our SVN web
interface and nothing else? Didn't we have a similar problem with
Baidu's extremely impolite crawler?

It's exactly Baidu. Which follows / mirrors all the links including
the "annotate" for all past revisions. The latter is extremely I/O
bound and therefore slow.

Any idea where the traffic is from? That sounds more likely to be a
confused robot than an intentional attack.

It's Baidu bot - 202.46.32.0/19

This sounds interesting:

https://www.pingdom.com/

--renato

I wanted to understand what's the foundation plans on infrastructure.

Hi Renato,

I’m confused by this mail. On the one hand, you claim to be asking about the foundation’s plans (which seems perfectly on topic for this list), but OTOH you are spending a lot of words talking about what you think the plan should be (something that would be more appropriate for a dev list). For example:

A. Code repositories

The SVN server is reasonably unstable, having outages that affect all buildbots.

I think that we as a community should strongly consider moving LLVM’s hosting to github. That would surely be a controversial topic, but could bring a lot of advantages to the project. This is something that can *only* be discussed on llvmdev though, discussing it here would exclude tons of people who really should be involved.

-Chris

I’m confused by this mail. On the one hand, you claim to be asking about the foundation’s plans (which seems perfectly on topic for this list), but OTOH you are spending a lot of words talking about what you think the plan should be (something that would be more appropriate for a dev list). For example:

Hi Chris,

This is a meta email. I'm spending some time laying out what kinds of
issues I think are important and what kinds of discussions I think
should take place, here or there, for each one of them.

So far, every time I asked about the foundation "what are the plans
for infrastructure", I got vague answers like "it's ok" or "we're
migrating the servers".

I can see how my vague question could be interpreted as just a check
up, so this time I decided to be very thorough. Maybe I did too
much... :frowning:

A. Code repositories

The SVN server is reasonably unstable, having outages that affect all buildbots.

I think that we as a community should strongly consider moving LLVM’s hosting to github. That would surely be a controversial topic, but could bring a lot of advantages to the project.

I'd love to see this discussion on the dev list! But I don't have the
specific details on the current costs / stability.

I don't think this is a matter of preference, but about costs and
integration with everyone else's infrastructure.

This is something that can *only* be discussed on llvmdev though, discussing it here would exclude tons of people who really should be involved.

Indeed, as I said early in the email:

> For example, changing the web server from cloud to cloud makes no
> difference, so it doesn't need to be a public process, but changing
> our repository provider may affect a lot of internal processes, and
> people will react badly if nothing is shared with them beforehand, so
> we need some exposure beforehand.

All my comments on this thread were to help the foundation understand
what issues I'd like to see discussed, here or there, about
infrastructure.

We still have a lot of infrastructure problems and decisions to take,
and the way companies (like Linaro) progresses with LLVM validation
depends a lot on where the rest of the community is going.

But instead of me sending one email per subjects in this thread to the
dev list, and starting a huge number of concurrent threads, I though
I'd ask the foundation to provide their view and numbers on such a
discussion.

After all, I don't know how the budget will be spent, nor I know who
will be taking care of the issues, or how many servers we have and
where they are. That is something only the foundation can bring to the
table, and it's something that I wouldn't start any discussion
without.

If you're happy to provide me with all the details (current status,
yearly budget), I can start those discussions on the dev list. But I'm
equally happy for the foundation to start that on its own, and thus,
this email.

cheers,
--renato

For what it's worth, the 2016 LLVM Foundation budget does have $4750
for AWS costs http://llvm.org/foundation/documents/other/2016-LLVMFoundation-Outlook-Budget.pdf
though it's not totally clear how much of the current infrastructure
that covers. Thanks to the Foundation for publishing that document,
it's a great move for transparency.

Chris, I think Renato was trying to engage in the process referred to
on page 2 of the budget and plans document. I know when I ask a small,
overworked team about something I'm tempted to put in a lot of detail
to be clear I'm interested in contributing to the process rather than
just hassling for a status update. I can't see in to Renato's mind,
but this sort of thinking may well motivate sections A-Z of Renato's
email :slight_smile:

It would be helpful to know how Foundation Board members see the split
between llvm-dev vs the llvm-foundation list working. My assumption
would be that dev-facing changes (e.g. a move to github or a change in
development process) make most sense on llvm-dev, while discussions of
one cloud vs another or whether a server is paid for or hosting is
donated by a company (i.e. how a service is provided) may make more
sense on llvm-foundation. Does that match your understanding?

Best,

Alex

For what it's worth, the 2016 LLVM Foundation budget does have $4750
for AWS costs http://llvm.org/foundation/documents/other/2016-LLVMFoundation-Outlook-Budget.pdf
though it's not totally clear how much of the current infrastructure
that covers. Thanks to the Foundation for publishing that document,
it's a great move for transparency.

Indeed, it is. But not all can be covered in such a short document.
For instance, it states:

"Last year, we moved the LLVM mailing lists to a new email server. We
will finish up the infrastructure changes by moving all LLVM services.
A full review will be done to address any outstanding infrastructure
concerns from the community."

I'm just trying to have the "full review" somewhere public.

Chris, I think Renato was trying to engage in the process referred to
on page 2 of the budget and plans document. I know when I ask a small,
overworked team about something I'm tempted to put in a lot of detail
to be clear I'm interested in contributing to the process rather than
just hassling for a status update. I can't see in to Renato's mind,
but this sort of thinking may well motivate sections A-Z of Renato's
email :slight_smile:

Spot-on. :slight_smile:

It would be helpful to know how Foundation Board members see the split
between llvm-dev vs the llvm-foundation list working. My assumption
would be that dev-facing changes (e.g. a move to github or a change in
development process) make most sense on llvm-dev, while discussions of
one cloud vs another or whether a server is paid for or hosting is
donated by a company (i.e. how a service is provided) may make more
sense on llvm-foundation. Does that match your understanding?

+1.

cheers,
--renato

Sorry for the top post. I just didn’t want this email to seem like its being ignored.

I have a document I’m working on (well updating it) that describes the upgrade plan. We absolutely have to get off of llvm.org in the very short term. There are many software issues that make it harder to gather information about the performance of the machine. We also have to get off for security reasons. So we must move the existing infrastructure to the AWS machine that we have reserved for this purpose (which is allocated in the budget). This machine should be more than adequate for our needs, but more analysis should be done once we have moved.

Now, you bring up a lot of good topics for discussion. My plan is not to change anything except the hardware/machine location. I think everything you bring up is something that can/should be discussed going forward and get community input on.

This mailing list is exactly the place to discuss what the plan is. Its not the place to have full flushed discussions about if we should use bugzilla or jenkins, etc. I don’t think that was necessarily your intention but it came across that way.

Thanks,
Tanya

Now, you bring up a lot of good topics for discussion. My plan is not to change anything except the hardware/machine location. I think everything you bring up is something that can/should be discussed going forward and get community input on.

Ok, so we'll migrate what we have, to give us time to have a proper
discussion without worrying about stability, and potentially take
rushed decisions.

It may cost a bit more overall, but we'll probably reach a better
place than if we rush things. Makes sense.

This mailing list is exactly the place to discuss what the plan is. Its not the place to have full flushed discussions about if we should use bugzilla or jenkins, etc.

Right, so should the document you're writing be exposed earlier, so we
can have an idea on time frames and short-term plans? It'd help to
know when we can start the discussions about GitHub, Bugzilla and
Jenkins.

And if that's something you want to start yourself, after you're done
with the current plans, it'd be another reason to get the docs out
earlier, even if unfinished, so we know when to expect the discussions
to happen, and plan around it.

cheers,
--renato

Yes, that right. There are two reasons for this:

1) we all want all the dev-facing policies and technical issues to be resolved by the community, where we already have a structure in place to review such things.
2) the foundation has one employee (Tanya), who is part time, so it has to pick and chose carefully what it tackles.

Most things can be done by the community, the foundation should only step in when there is no ability (e.g. legal issues) or apparent will (e.g. the website overhaul) to do it.

-Chris