Buildbot insights

Hi,

I am maintaining the daily LLVM snapshots for Fedora. To get a better understanding of the status of te builds I wanted to visualize “things” in Grafana using Postgres as a datasource. I’ve never done something like that but I was able to feed the JSON data from our build system as SQL instructions into our database.

Then a colleague asked if I could do a similiar thing for the LLVM buildbots. It was possible straight forward and analog to my previous work. All in all I’ve collected and stored more than 2,500,000 rows of denormalized build information for the staging and production buildbots.

This is a visualization of the build failures of each buildbot builder over the last days.

The SQL query for this is plain and simple:

SELECT
  $__timeGroup("build_started_at", '24h'),
  count(*) as "value",
  builder_name as "metric"
FROM buildbot_build_logs
WHERE
  $__timeFilter(build_started_at)
  AND build_complete = true
  AND buildbot_instance in (${buildbot_instance:sqlstring})
  AND build_results <> 0
GROUP BY "time", builder_name
ORDER BY 1

I would like to get some feedback on this to find out if there’s more insight that you guys are interested in.

To test out all of this on your own linux machine, try this (attention: this is work in progress):

git clone https://github.com/kwk/llvm-snapshot-monitoring.git
cd llvm-snapshot-monitoring
# Build the container images
make build
# Start the services (opens ports: 3000 for grafana, 5342 for postgres, and 8080 for adminer)
make start
# Hit ctrl-c once you see the logs flying through.
# Now pre-fill the database with buildbot data already prepared in the git repo.
make load-buildbot-logs 
# Open grafana and use admin/admin as credentials to view a dashboard.
xdg-open http://localhost:3000

I look forward to reading your suggestions or requests for insights.

Regards,
Konrad

5 Likes

This is really cool. It seems like we could use this to identify buildbots that fail often and need to be disabled. I’m trying to figure out the best visualization to help do that. I think maybe showing the failure count over the course of a week or a month rather than a day might help.

The raw failure count may skew the data towards faster buildbots, so maybe showing failure percentage (i.e. number of fails / number of runs) would be better too.

I’d be interested in other kind of metrics as well, for example:

  • how long after the commit the build was completed?
  • how many commits are bundled together in a single build?
2 Likes

@mehdi_amini thank you for this input. I’m actively working on this and will let you know when it’s done :slight_smile: