Sequential ID Git hook

Now that we seem to be converging to an acceptable Git model, there
was only one remaining doubt, and that's how the trigger to update a
sequential ID will work. I've been in contact with GitHub folks, and
this is in line with their suggestions...

Given the nature of our project's repository structure, triggers in
each repository can't just update their own sequential ID (like
Gerrit) because we want a sequence in order for the whole project, not
just each component. But it's clear to me that we have to do something
similar to Gerrit, as this has been proven to work on a larger
infrastructure.

Adding an incremental "Change-ID" to the commit message should
suffice, in the same way we have for SVN revisions now, if we can
guarantee that:

1. The ID will be unique across *all* projects
2. Earlier pushes will get lower IDs than later ones

Other things are not important:

3. We don't need the ID space to be complete (ie, we can jump from
123 to 125 if some error happens)
4. We don't need an ID for every "commit", but for every push. A
multi-commit push is a single feature, and doing so will help
buildbots build the whole set as one change. Reverts should also be
done in one go.

What's left for the near future:

5. We don't yet handle multi-repository patch-sets. A way to
implement this is via manual Change-ID manipulation (explained below).
Not hard, but not a priority.

  Design decisions

This could be a pre/post-commit trigger on each repository that
receives an ID from somewhere (TBD) and updates the commit message.
When the umbrella project synchronises, it'll already have the
sequential number in. In this case, the umbrella project is not
necessary for anything other than bisect, buildbots and releases.

I personally believe that having the trigger in the umbrella project
will be harder to implement and more error prone.

The server has to have some kind of locking mechanism. Web services
normally spawn dozens of "listeners", meaning multiple pushes won't
fail to get a response, since the lock will be further down, after the
web server.

Therefore, the lock for the unique increment ID has to be elsewhere.
The easiest thing I can think of is a SQL database with auto-increment
ID. Example:

Initially:

create table LLVM_ID ( id int not null primary key

auto_increment, repository varchar not null, hash varchar nut null );

alter table LLVM_ID auto_increment = 300000;

On every request:

insert into LLVM_ID values ("$repo_name", "$hash");
select_last_inset_id(); -> return

and then print the "last insert id" back to the user in the body of
the page, so the hook can update the Change-id on the commit message.
The repo/hash info is more for logging, debugging and conflict
resolution purposes.

We also must limit the web server to only accept connections from
GitHub's servers, to avoid abuse. Other repos in GitHub could still
abuse, and we can go further if it becomes a problem, but given point
(3) above, we may fix that only if it does happen.

This solution doesn't scale to multiple servers, nor helps BPC
planning. Given the size of our needs, it not relevant.

  Problems

If the server goes down, given point (3), we may not be able to
reproduce locally the same sequence as the server would. Meaning
SVN-based bisects and releases would not be possible during down
times. But Git bisect and everything else would.

Furthermore, even if a local script can't reproduce exactly what the
server would do, it still can make it linear for bisect purposes,
fixing the local problem. I can't see a situation in which we need the
sequence for any other purpose.

Upstream and downstream releases can easily wait a day or two in the
unlucky situation that the server goes down in the exact time the
release will be branched.

Migrations and backups also work well, and if we use some cloud
server, we can easily take snapshots every week or so, migrate images
across the world, etc. We don't need duplication, read-only scaling,
multi-master, etc., since only the web service will be writing/reading
from it.

All in all, a "robust enough" solution for our needs.

  Bundle commits

Just FYI, here's a proposal that appeared in the "commit message
format" round of emails a few months ago, and that can work well for
bundling commits together, but will need more complicated SQL
handling.

The current proposal is to have one ID per push. This is easy by using
auto_increment. But if we want to have one ID per multiple pushes, on
different repositories, we'll need to have the same ID on two or more
"repo/hash" pairs.

On the commit level, the developer adds a temporary hash, possibly
generated by a local script in 'utils'. Example:

  Commit-ID: 68bd83f69b0609942a0c7dc409fd3428

This ID will have to be the same on both (say) LLVM and Clang commits.

The script will then take that hash, generate an ID, and then if it
receives two or more pushes with such hashes, it'll return the *same*
ID, say 123456, in which case the Git hooks on all projects will
update the commit message by replacing the original Commit-ID to:

  Commit-ID: 123456

To avoid hash clashes in the future, the server script can refuse
existing hashes that are a few hours old and return error, in which
case the developer generates a new hash, update all commit messages
and re-push.

If there is no Commit-ID, or if it's empty, we just insert a new empty
line, get the auto increment ID and return. Meaning, empty Commit-IDs
won't "match" any other.

To solve this on the server side, a few ways are possible:

A. We stop using primary_key auto_increment, handle the increment in
the script and use SQL transactions.

This would be feasible, but more complex and error prone. I suggest we
go down that route only if keeping the repo/hash information is really
important.

B. We ditch keeping record of repo/hash and just re-use the ID, but
record the original string, so we can match later.

This keeps it simple and will work for our purposes, but we'll lose
the ability to debug problems if they happen in the future.

C. We improve the SQL design to have two tables:

LLVM_ID:
   * ID: int PK auto
   * Key: varchar null

LLVM_PUSH:
   * LLVM_ID: int FK (LLVM_ID:ID)
   * Repo: varchar not null
   * Push: varchar not null

Every new push updates both tables, returns the ID. Pushes with the
same Key re-use the ID and update only LLVM_PUSH, returns the same ID.

This is slightly more complicated, will need to code scripts to gather
information (for logging, debug), but give us both benefits
(debug+auto_increment) in one package. As a start, I'd recommend we
take this route even before the script supports it. But it could be
simple enough that we add support for it right from the beginning.

I vote for option C.

  Deployment

I recommend we code this, setup a server, let it running for a while
on our current mirrors *before* we do the move. A simple plan is to:

* Develop the server, hooks and set it running without updating the
commit message.
* We follow the logs, make sure everything is sane
* Change the hook to start updating the commit message
* We follow the commit messages, move some buildbots to track GitHub
(SVN still master)
* When all bots are live tracking GitHub and all developers have moved, we flip.

Sounds good?

cheers,
--renato

Given the nature of our project's repository structure, triggers in
each repository can't just update their own sequential ID (like
Gerrit) because we want a sequence in order for the whole project, not
just each component. But it's clear to me that we have to do something
similar to Gerrit, as this has been proven to work on a larger
infrastructure.

I'm assuming that pushes to submodules will result in a (nearly)
immediate commit/push to the umbrella repo to update it with the new
submodule head. Otherwise, checking out the umbrella repo won't get you
the latest submodule updates.

Since updates to the umbrella project are needed to synchronize it for
updates to sub-modules, it seems to me that if you want an ID that
applies to all projects, that it would have to be coordinated relative
to the umbrella project.

  Design decisions

This could be a pre/post-commit trigger on each repository that
receives an ID from somewhere (TBD) and updates the commit message.
When the umbrella project synchronises, it'll already have the
sequential number in. In this case, the umbrella project is not
necessary for anything other than bisect, buildbots and releases.

I recommend using git tag rather than updating the commit message
itself. Tags are more versatile.

I personally believe that having the trigger in the umbrella project
will be harder to implement and more error prone.

Relative to a SQL database and a server, I think managing the ID from
the umbrella repository would be much simpler and more reliable.

Managing IDs from a repo using git meta data is pretty simple. Here's
an example script that creates a repo and allocates a push tag in
conjunction with a sequence of commits (here I'm simulating pushes of
individual commits rather than using git hooks for simplicity). I'm not
a git expert, so there may be better ways of doing this, but I don't
know of any problems with this approach.

#!/bin/sh

rm -rf repo

# Create a repo
mkdir repo
cd repo
git init

# Create a well known object.
PUSH_OBJ=$(echo "push ID" | git hash-object -w --stdin)
echo "PUSH_OBJ: $PUSH_OBJ"

# Initialize the push ID to 0.
git notes add -m 0 $PUSH_OBJ

# Simulate some commits and pushes.
for i in 1 2 3; do
   echo $i > file$i
   git add file$i
   git commit -m "Added file$i" file$i
   PUSH_TAG=$(git notes show $PUSH_OBJ)
   PUSH_TAG=$((PUSH_TAG+1))
   git notes add -f -m $PUSH_TAG $PUSH_OBJ
   git tag -m "push-$PUSH_TAG" push-$PUSH_TAG
done

# list commits with push tags
git log --decorate=full

Running the above shows a git log with the tags:

commit a4ca4a0b54d5fb61a2dacbab5732d00cf8216029 (HEAD, tag:
refs/tags/push-3, refs/heads/master)
...
     Added file3

commit e98e2669569d5cfb15bf4cd1f268507873bcd63f (tag: refs/tags/push-2)
...
     Added file2

commit 5c7f29107838b4af91fe6fa5c2fc5e3769b87bef (tag: refs/tags/push-1)
...
     Added file1

The above script is not transaction safe because it runs commands
individually. In a real deployment, git hooks would be used and would
rely on push locks to synchronize updates. Those hooks could also
distribute ID updates to the submodules to keep them synchronized.

Tom.

I don’t think we should do any of that. It’s too complicated – and I don’t see the reason to even do it.

There’s a need for the “llvm-project” repository – that’s been discussed plenty – but where does the need for a separate “id” that must be pushed into all of the sub-projects come from? This is the first I’ve heard of that as a thing that needs to be done.

There was a previous discussion about putting an sequential ID in the “llvm-project” repo commit messages (although, even that I’d say is unnecessary), but not anywhere else.

Agreed, the llvm-project repository can completely take on the role of the
SQL database in Renato's proposal.

Chromium created a "git-number" extension that assigns sequential ids to
commits in the obvious way, and that provided some continuity with the
"git-svn-id:" footers in commit messages. I'm not sure their extension is
particularly reusable, though:
https://chromium.googlesource.com/chromium/tools/depot_tools.git/+/master/git_number.py

I think for LLVM, whatever process updates the umbrella repo should add the
sequential IDs to the commit message, and that will help provide continuity
across the git/svn bridge.

Hum, doing it in a separate server was suggested by the GitHub folks,
so I just assumed they can't do that in the umbrella project for some
reason.

I'm all for using the umbrella if we can, I just though we couldn't... :frowning:

Can someone try the suggested tag style? Are we sure we can guarantee
atomicity in there? I know SQL can. :slight_smile:

I know that changing the commit message works because of Gerrit and
our current SVN integration, I don't know how much adding one tag for
each push will work in Git over time.

cheers,
--renato

I assumed we want ids for the umbrella repository to ease bisection and having something to print as a version identifier, but do we really need them for the other repositories?

I also still don’t see why git rev-list --count --all does not work. Sure the count is only per branch, but why would ever need to have continuous numbering between say a 3.8.XX revision and a 3.9.XX branch…

  • Matthias

From: cfe-dev [mailto:cfe-dev-bounces@lists.llvm.org] On Behalf Of Renato
Golin via cfe-dev
Sent: Thursday, June 30, 2016 9:49 AM
To: Reid Kleckner
Cc: LLVM Dev; llvm-foundation@lists.llvm.org; Clang Dev; LLDB Dev
Subject: Re: [cfe-dev] [lldb-dev] [llvm-dev] Sequential ID Git hook

> Agreed, the llvm-project repository can completely take on the role of
the
> SQL database in Renato's proposal.

Hum, doing it in a separate server was suggested by the GitHub folks,
so I just assumed they can't do that in the umbrella project for some
reason.

I'm all for using the umbrella if we can, I just though we couldn't... :frowning:

Can someone try the suggested tag style? Are we sure we can guarantee
atomicity in there? I know SQL can. :slight_smile:

I know that changing the commit message works because of Gerrit and
our current SVN integration, I don't know how much adding one tag for
each push will work in Git over time.

We were using tags for a while in our own SVN->git conversion internally.
(git branch is pushed to SVN and the SVN r-number used to create a tag.)
They are convenient for some things, but each tag adds a new (if small)
file to .git/tags and I don't know that it really scales well when you
are talking about (long term) hundreds of thousands of them. That was
not what tags were designed for.

We've since stopped creating the tags, and gotten used to not having
them. We do the 'rev-list --count' trick which mainly gets recorded as
one component of the version number, and it has been working for us.

I think having the number in the commit log (even if it's just for the
superproject) would be preferable. You can use 'git log --grep' to
find a particular rev if you need to.
--paulr

(talking about lots of tags)

I don’t know that it really scales well when you
are talking about (long term) hundreds of thousands of them.

I can say from experience that it does not scale well. After some time, everyone would start feeling the pain.

Jim Rowan
jmr@codeaurora.org
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation

Does that work for sub modules inside the umbrella project?

How can you trigger a hook in the umbrella project for commits inside the sub modules?

Enough people complained about the sequential number, and I’m trying to solve that problem. I particularly have no use for that at all.

We can still do the bundling pushes with the same id by adding metadata to the commit message and training our CI, which is orthogonal to this issue.

Cheers,
Renato

We’ve since stopped creating the tags, and gotten used to not having
them. We do the ‘rev-list --count’ trick which mainly gets recorded as
one component of the version number, and it has been working for us.

Does that work for sub modules inside the umbrella project?

How can you trigger a hook in the umbrella project for commits inside the sub modules?

First: This is purely about generating sequential revision numbers, it does not help setting up a server hook to update the submodule references in the meta repository. The point I am trying to make here is that we only need to solve the problem of updating the submodule references, and that generating sequential ID numbers as an alternative to git hashes is no problem.

As far as I can see we need the following operations for sequential ID numbers and they are all easy enough to perform with git on the client side:

  1. Produce revision number for current checkout (to use in tool --version output):
    You can put something like

echo “#define VERSION $(git rev-list --count HEAD)” > version.h
in your buildsystem

  1. Convert git hash to revision number:
    git rev-list --count $HASH

  2. Convert revision number $NUM to git hash:
    git rev-list HEAD | tail -n $NUM | head -n 1

  • Matthias

We've since stopped creating the tags, and gotten used to not having
them. We do the 'rev-list --count' trick which mainly gets recorded as
one component of the version number, and it has been working for us.

Does that work for sub modules inside the umbrella project?

I don't know, we don't use submodules, we use subtree merges to create
an omnibus branch. If it *does* work in the parent project, that would
be cool (and might motivate us to reorganize our internal branches that
way, but not this year).
--paulr

I think for each commit on a subprojects we would create 1 commit on the meta project (which switches the submodule reference to that commit on the subproject). That also implies that the ID number (which is just the length of the history) in the meta project increases with each commit in a subproject.

- Matthias

We were using tags for a while in our own SVN->git conversion internally.
(git branch is pushed to SVN and the SVN r-number used to create a tag.)
They are convenient for some things, but each tag adds a new (if small)
file to .git/tags and I don't know that it really scales well when you
are talking about (long term) hundreds of thousands of them. That was
not what tags were designed for.

We're using tags in this manner for our internal repos and LLVM/Clang
mirrors and haven't experienced any problems. We're at ~50k tags for
our most used repo, so not quite at hundreds of thousands yet.

When I look in .git/refs/tags of one of my repos, I do *not* see 50k
files; I see ~400. I'm not sure what causes some to appear here and
others not.

I don't see how this use of tags is not representative of what tags were
designed for. They are designed to label a commit. That seems to match
well what is desired here.

We've since stopped creating the tags, and gotten used to not having
them. We do the 'rev-list --count' trick which mainly gets recorded as
one component of the version number, and it has been working for us.

As I understand it, 'git rev-list --count HEAD' requires walking the
entire commit history. Perhaps the performance is actually ok in
practice, but I would be concerned about scaling with this approach as well:

$ time git rev-list --count HEAD
115968

real 0m1.170s
user 0m1.100s
sys 0m0.064s

I think having the number in the commit log (even if it's just for the
superproject) would be preferable. You can use 'git log --grep' to
find a particular rev if you need to.

Grepping every commit doesn't seem like the most scalable option either.
  I did a quick test on a large repo. First a grep for an identifier:

$ time git log --grep <id>
...
real 0m1.450s
user 0m1.340s
sys 0m0.092s

Then I did the same for the associated push tag:

$ time git log -n 1 <tag>
...
real 0m0.048s
user 0m0.024s
sys 0m0.016s

Tom.

Can you elaborate on this? As I mentioned in another email, we're at
~50k tags in one repo and not having any problems. I can't see why git
would fundamentally have scaling or performance issues in conjunction
with lots of tags. Perhaps some UI interfaces were failing to scale well?

Tom.

My issue with using tags like this is that they pollute the tag
namespace and will quickly swamp what I consider to be the important
ones ("release-X"). OK, so we've got "git tag -l 'release*'" but
that's pretty ugly.

Tim.

Hi Tom,

What if we had a mixed mode?

Say we create a tag for each release, and we know what was the last
release's tag number (245234), then we just need to count how many
commits "since the last tag", which will always be the last release.

This will have all the tags we want, and scale forever.

Is rev-list able to do that?

--renato

Thousands of tags would cause problems, even with packed-refs.
For example, both git-fetch and git-push queries thousands of local tag to remote.

I was playing with 200k of tags. :wink: I wish git had “thin tag”, not to affect to remotes.

Thousands of tags would cause problems, even with packed-refs.
For example, both git-fetch and git-push queries thousands of local tag
to remote.

Can you qualify these statements with some data? Preferably a script
that can demonstrate scaling issues? As I mentioned previously, we're
at ~50k tags and, as far as I can tell, we aren't experiencing any
problems associated with them.

I was playing with 200k of tags. :wink: I wish git had "thin tag", not to
affect to remotes.

You can opt out of pulling tags by passing '--no-tags' to git pull/fetch
or by configuring 'remote.<name>.tagopt' via git config.

Tom.

I just want to point out another alternative that I often use in my projects: (Example from buildbot repo I had handy)

$ git describe

v0.9.0b8-579-ge06cac6

The format is “TAG-N-gREV”, where TAG is the closest reachable git tag in the past (“v0.9.0b8”), N is the relative number of revisions past that tag (579), and REV is the short revision to make it unique and easy to locate (“e06cac6”, the ‘g’ is a literal character that prefixes the revision). In other words, that “-579-” represents a monotonically increasing value relative to the named tag and might serve your purposes.

Jared