Perf is dead again... :(

The LNT Perf reporting website is down for a few days now, should I
disable my perf bot to avoid noise while that gets fixed?

cheers,
--renato

For this very reason my perf bots do not send out emails.

Galina just reported these issues a couple of days ago, but unfortunately Daniel, who has access, did not yet respond.

My impression is that LNT has troubles with the large amount of data we
now managed to collect. So to debug this, we would possibly need someone who can have a closer look at LNT on llvm.org.

Daniel, could you do this?

If Daniel is not available, maybe Anton knows if some other person could get access to llvm.org to look into these issues. I suggested Chris, but if you have time Renato, maybe you could give it a shot as well?

Cheers,
Tobias

I could, just need access to the machine.

cheers,
--renato

I can look into it if I get access.

Anton, any chance we can get Renato or Chris access to llvm.org to fix this?

Tobias

I realized that we never finalized the switchover to Postgres, instead the default database was still SQLite (which has grown huge) and it was shadow importing into the PostgreSQL database.

I have now switch it over to only run against Postgres, which I suspect will eliminate the failures we were seeing. Please let me know if you notice any problems. It seems like the switch already gives a big improvement to the response time of the perf homepage.

I’m not sure what to do w.r.t. access to the machine, I think the best solution is to try and move LNT off of llvm.org to a machine we don’t need to be as careful with.

FYI, I attached the LNT log in case anyone wants to look at the errors and see if we can fix the SQLite implementation to not fail on them.

  • Daniel

lnt.log (33.3 KB)

I realized that we never finalized the switchover to Postgres, instead the
default database was still SQLite (which has grown huge) and it was shadow
importing into the PostgreSQL database.

I have now switch it over to only run against Postgres, which I suspect
will eliminate the failures we were seeing. Please let me know if you
notice any problems. It seems like the switch already gives a big
improvement to the response time of the perf homepage.

Great. Thanks for looking into this.

I'm not sure what to do w.r.t. access to the machine, I think the best
solution is to try and move LNT off of llvm.org to a machine we don't need
to be as careful with.

The only issue that regularly came up with LNT was the stability. Maybe this is now solved and LNT hopefully does not need so much maintenance.

FYI, I attached the LNT log in case anyone wants to look at the errors and
see if we can fix the SQLite implementation to not fail on them.

The sqlite db seems to be locked (possibly due to a larger import). By moving to postgres, we may have "fixed" this issue.

Cheers,
Tobias

Bad news, I just got an error 500 again. Could you possibly hand out a version of the logs, Daniel?

Tobias

The error on SQLite was:

database is locked u'SELECT "StatusKind"."ID" AS "StatusKind_ID",
"StatusKind"."Name" AS "StatusKind_Name" \nFROM "StatusKind" \nWHERE
"StatusKind"."ID" = ?\n LIMIT ? OFFSET ?'

Which is understandable, since SQLite is not meant for production
environments, and being in a shared machine, swap or high load could
force timeouts and bad applications could not have released locks or
something.

Is this happening again on PostgreSQL? Is it the same error? Are you
sure the web application is indeed connecting to PostgreSQL?

I agree this should be in a separate machine. Do we use some sort of
cloud server for that? Can we get another instance under the LLVM
Foundation's umbrella? I think it's time we focus on the quality of
our services as a whole, and centralised the administration by having
some people access to all boards, so that we can cover each other in
cases like these.

Tanya, do we have plans for something like that?

cheers,
--renato

I'm not sure what to do w.r.t. access to the machine, I think the best
solution is to try and move LNT off of llvm.org to a machine we don't need
to be as careful with.

Just a thought. Would it make sense to put LNT server into a Docker
[1] container so it's portable and then we can move it over to any
(Linux based) host we like easily and reliably?

I've been playing around with Docker lately (I really like it) so I'd
be happy to hack something together for you to try out. I don't have
much experience with LNT though and I don't know how to implement
database fault tolerance with. I presume we would just have a separate
container for the database but I'm not sure how the replication would
be done.

A possible home for these docker containers could be Google compute
engine [2]. Google do make use of LLVM so I wonder if they would be
willing to provide free cloud hosting services for the LLVM project.
There are of course many other cloud platforms providers (e.g. Amazon
EC2, Digital Ocean, Tutum...) but I'm not sure if they would be
willing to provide free compute resources (and support) for us.

I would hope (I don't know for sure) that these services would allow
multiple users to manage the containers so that we could have multiple
people able to manage them rather than having the single point of
failure like we do now.

But maybe this is too ambitious...

Thanks,
Dan.

[1] https://www.docker.com/
[2] https://cloud.google.com/compute/docs/containers

I think this is a great idea. We should at least try, while the other
is in "production", and if it proves stable, we switch.

cheers,
--renato

Proving 'stable' is not that easy. In fact LNT works flawless for me at home.

I am afraid the issue only arises due to the huge database that we have after imports from several years and parallel accesses both due to LNT submits as well as users looking at LNT.

Having said this, anything that allows us to have a LNT instance that can be maintained/debugged easily and where we can point our LNT testers to, will simplify debugging of this production LNT issues.
Cheers,
Tobias

Proving 'stable' is not that easy. In fact LNT works flawless for me at
home.

It's the volume, indeed.

Having said this, anything that allows us to have a LNT instance that can be
maintained/debugged easily and where we can point our LNT testers to, will
simplify debugging of this production LNT issues.

Yes, maintainability and stability are more important that speed.
Though, it'd be good to have database and web separate, so we can
scale them differently, and as needed. Docker or cloud instances look
the right way to go, for me.

cheers,
--renato

Yes, maintainability and stability are more important that speed.
Though, it'd be good to have database and web separate, so we can
scale them differently, and as needed. Docker or cloud instances look
the right way to go, for me.

I'm currently looking into this. It seems to be possible to have
containers linked together [1] so my plan is to have a "LNT server"
container (just the python stuff) and the "database" container
separate and then link them together when launching the containers.

Is there a way for me to obtain a copy of the current postgres
database so I can test how well this works?

[1] https://docs.docker.com/userguide/dockerlinks/

This was part of the motivation my cloud LNT instance.

The Heroku cloud which I am running on is using a postgres cluster, not a single machine. It is well tuned for large databases.

If it is maintained and well tuned, maybe we can just point llvm.org/perf there and submit to this instance?

Cheers,
Tobias

The cloud instance would need funding to run at the scale the llvm.org server runs at. It can only accommodate about 5k submissions worth of data at its current size. I assume that would be exhausted pretty fast with a bunch of machines submitting several times per day.

Hi Chris,

The cloud instance would need funding to run at the scale the llvm.org
server runs at. It can only accommodate about 5k submissions worth of data
at its current size. I assume that would be exhausted pretty fast with a
bunch of machines submitting several times per day.

I haven't used Heroku but if it's working and they provide access to a
postgres cluster (is this [1] what you're using?) then that sounds
very handy.

Given that the problem is database reliability, Docker isn't really
going to solve this problem on its own. I will probably still make a
container but it's going to be for a separate project of mine so that
will be orthogonal to this thread.

So I guess two questions come to mind:

* Is Heroku is the right choice (we don't want to be locked into
Heroku)? Given the progress you've made it sounds suitable.
* Who should fund whatever host is used to host the LNT
infrastructure. Given the commercial interest in LLVM I hope that this
will be straight forward

Presumably this is what the LLVM foundation [2] is for? Should we
start a new thread CC'ing the foundation directors on the subject of
finding a host for the LNT infrastructure?

[1] https://www.heroku.com/postgres
[2] http://blog.llvm.org/2014/04/the-llvm-foundation.html

Thanks,
Dan.

FWIW, if you can use google's cloud offerings, I can likely fund it. This
isn't about only being willing to fund our platform vs. some other
platform, or which platform is better. It's just that we're already well
set up to handle this when its on top of GCE and related services -- this
is how we are currently running the phabricator instance and we're working
on setting up bots here as well.

But I have no idea if GCE and related bits are even useful for what LNT
needs. Just let me know if they are, and I'll try to get things rolling.

I don't see any conflicts, there. If we can use it, I can't see why we
shouldn't. The admin cost will be a lot less, anyway.

If it ends up as not being well suited, we tried. It'd also give you
guys more info on what to do on those cases to make them better, since
that's a standard web application.

cheers,
--renato