Sequential ID Git hook

“git-describe -t” works also for lw tags.

2016年7月4日(月) 2:48 Jared Grubb via cfe-dev <cfe-dev@lists.llvm.org>:

But it doesn't work if there are no tags, I just tested on LLVM and I get:

$ git describe
fatal: No names found, cannot describe anything.

Should be easy enough to create the tags on each branching point, though.

"describe" also seems to be available at least since Git 1.9, so it
should be pretty safe.

And since tagging *every* commit doesn't scale for long ranges, and
anything else will need scripting on the client side, I think we can
get rid *completely* of any server side hook, and let the client side
scripts deal with the output of "git describe".

Or am I just too optimistic?

cheers,
--renato

One of the nice features of GitHub is that it provides a download link to grab tarballs for any specific version. These are easy to se with other workflows (for example, the FreeBSD ports collection has infrastructure for grabbing a release and building it). It would be a shame if you needed a full git clone to get the revision information, as that would mean that anyone who built from a tarball would lose it.

David

What doesn’t scale about tagging every commit?

True, every tag creates a small file on disk, but then so does every commit.

If you’re worried about lots of files in a directory then you can put tags in nested directories by putting one or more /'s in the tag name. So you can hide all the commit tags in, say .git/refs/tags/commits and put release tags in the root tags directory, or another subdirectory. i.e. “git tag commit/23456 HEAD”. Things such as shell command autocomplete (e.g. git checkout) deal intelligently with the tags directory structure, so you’re not overwhelmed by 10000 commit tags when you just want to see the 40 release tags.

Also the files are only a short term thing anyway. When a “git gc” or “git pack” happens, tags get added to the compressed pack file just the same as objects do and the .git/refs/tags directory is cleared out.

Both Jim and Takumi have reported problems with thousands of tags.
Even though neither of them responded to your enquiries for additional
data, we can't assume there isn't any.

Furthermore, "git describe" seems to be the "mixed mode" I asked
about, and it's already in git since an old version, so I'm not sure
why we'd even need to create one tag per commit anyway.

People should be using Git for bisects, in which case it works out of
the box. The incremental version was mostly to tag the build with
something meaningful, which "git describe" is.

Even if you want to use the result of describe to bisect like SVN, it
works because our history is linear (and you can count the number of
commits between A and B, and even store a local list of the hashes in
between.

I really can't see why would we need to tag every commit.

cheers,
--renato

What doesn't scale about tagging every commit?

Both Jim and Takumi have reported problems with thousands of tags.
Even though neither of them responded to your enquiries for additional
data, we can't assume there isn't any.

Agreed. Adding a tag to every commit (especially in something with as many commits as LLVM/clang) would be a nightmare for anyone that the pretty forms of git-log (eg, "git log --graph --abbrev-commit --pretty=oneline --decorate --color") or GUI-based programs for Git. I imagine the dropdown menus on Github wouldn't be fun to use either.

Very few operations search for commit objects by reading every single commit file. Most operations that read commit objects already know what they are looking for based on their hash. Plus, over time commit objects are packed into well indexed archive files, so the total number of commits stored in the filesystem never becomes an issue.

On the other hand, there are many commonly used git commands that might load and parse the entire set of references in order to function. git describe, git log (–decorate|–all), git fetch, git push, …

Quick re-cap.

After a few rounds, not only the "external server" proposal got
obliterated as totally unnecessary, but the idea that we may even need
a hook at all is now challenged.

Jared's idea to use "git describe" is in line with previous proposals
to use rev-list --count and to do so only up to the previous tag, but
all in one nice and standard little feature.

There were concerns by applying one tag per commit, but most of them
offered weak evidence. However, if "describe" can cover all our needs,
there is no point in even discussing tags.

Just for reference, GitHub *does* have an SVN interface [1], and you
can already checkout a specific revision with "svn checkout -r NNN
repo", which *is already* using "git rev-list --count".

This means that, for SVN based bisects, using GitHub will make it
*completely transparent* for SVN users. You can also base your
releases off an SVN view of the Git repo.

So, to clear up this discussion and finish my proposal to move to
GitHub, my final questions, only to those that *want* SVN
compatibility:

1. Is there anything in the SVN view of GitHub that *doesn't* work for
you? (ie. same as using "rev-list --count")

2. If so, can "git describe" solve the problem?

3. If not, please describe, in details, why <<your alternative

would be the *only* way forward.

I'll let this sit for a few days, and if no one has any serious issue,
I'll write up the final proposal and start the voting process with the
Foundation.

cheers,
--renato

[1] Announcing SVN Support - The GitHub Blog

Note that GitHub (currently, at least) doesn’t export submodules sensibly with their svn version. I don’t intend to use the svn thing (the only time that I have used it in anger was to replace a project that moved to GitHub with an svn:external that referred to the GitHub repo so people could easily find the new location), but that would cause problems if anyone wants to do an svn bisect.

I think it would help to have a description of how to bisect for a clang or lldb (or some other subproject) regression. For downstream users, it would also be nice if tools like git-imerge let you merge clang and llvm together, though that’s a nice-to-have feature that we currently lack so shouldn’t in any way block the migration.

David

Note that GitHub (currently, at least) doesn’t export submodules sensibly with their svn version.

SVN users can continue using the projects directly, they *just* need
to change the SVN repository location of every project to GitHub.

It can't get simpler than that.

I think it would help to have a description of how to bisect for a clang or lldb (or some other subproject) regression.

There are plenty of good documentation
(Git - Book) that teaches everything one needs to
know about git (but was afraid to ask).

Using the umbrella project, the sub-modules will make it trivial to
bisect. Using SVN view for individual projects will make it
*identical* to bisect as it was.

For downstream users, it would also be nice if tools like git-imerge let you merge clang and llvm together, though that’s a nice-to-have feature that we currently lack so shouldn’t in any way block the migration.

Git imerge is an amazing tool, but it's not production quality yet, I
think. Though, this is really an orthogonal issue, since SVN-bound
people can still use the SVN view and merge their own patches
downstream.

cheers,
--renato

You don't see many tags because tags (both heavy and lightweight) are
transferred from the file system into a pack file whenever a "git repack"
or "git gc" happens.

This is not clear to me.
How is the umbrella repository updated?

Sorry, I meant no hooks for updating sequential ids. We still need a hook to update the umbrella project.

Cheers,
Renato

Sequential IDs are important for LNT and llvmlab bisection tool.

LNT uses the “order” to capture the measured software changes. LNT does make the assumption that orders are unique, so if a ID was the same on two branches, LNT would assume that is the same change. If you never want to compare data between branches, storing each branch in a different database solves that problem, but sometimes you do want to directly compare runs in two branches.

With both llvmlab and LNT, once you get to a range of IDs, it is needs to be easy to find out what commits or commit range those IDs map to. When given regression between 123 and 225, I need the list of commits, and I don’t want to log grep for those numbers. Ideally it should also easy for those tools to link to a revision on a webUI like viewvc.

Making no comment on how easy or hard it will actually be, doesn't it
just have to be possible? We're talking about some python function
someone will write once and forget about pretty much forever
afterwards, aren't we?

Tim.

I could see wanting to compare data from master and a release branch. If that means sequential IDs need to work across branches, then we’re back to needing a fancier solution than ‘rev-list –count’.

–paulr

That's as easy as: git rev-list --count hash, relying on the fact that
our history is linear, you just have to do basic arithmetic (on the
umbrella project) to get the final sequence.

All of it can be done on the client, not the server.

cheers,
--renato

How would you do this in SVN anyway?

Branch commits are inter-twined with trunk commits, and comparing them
numerically doesn't yield the results you expect.

At least in Git, the history is tied up via "parent" and not via
sequential IDs, so you can actually walk the path.

Sequential numbers are only meaningful for linear histories. Branches,
whether on Git or Svn break that promise.

If we make LNT work with Git "as Git", then all problems are solved.
And meanwhile, we get to work with LNT "as SVN" via rev-list --count.

cheers,
--renato

I could see wanting to compare data from master and a release branch. If
that means sequential IDs need to work across branches, then we're back to
needing a fancier solution than 'rev-list –count'.

How would you do this in SVN anyway?

Branch commits are inter-twined with trunk commits, and comparing them
numerically doesn't yield the results you expect.

If I give the revision that corresponds to some 3.8 branch commit and compare to another in master, I think I get exactly the result I expect.
I may not understand your point here...

At least in Git, the history is tied up via "parent" and not via
sequential IDs, so you can actually walk the path.

Sequential numbers are only meaningful for linear histories. Branches,
whether on Git or Svn break that promise.

SVN has monotonic increasing ids that are unique across branches.

If we make LNT work with Git "as Git", then all problems are solved.
And meanwhile, we get to work with LNT "as SVN" via rev-list --count.

You missed the point that in a single instance of LNT a revision number has to be unique.
The rev-list thing won't provide this across branches.
A rev-list count number won't identify a revision, you need the tuple (branch, count), which is less easy or less compatible with existing systems.

With svn the the IDs are unique in, so r123 implies a branch. svn log —revision=123:234 give the right change list when svn is directed at that branch. So right now, I can get the change list in one command easily without a script.

llvmlab bisect already encodes branch information in the build name.

If LNT is the only holdout I suggest we update the LNT model to natively handle git. The order model has other problems besides this, and I think it would make LNT more useful to capture changes in a richer way. Something along the lines of registering repos with LNT, and having orders generated based on unique combinations of hashes. That would also allow us to track changes in the CI drivers, LNT and the test-suite as first class entities in the change list. That would also allow us to show change messages in the LNT interface. Basically internalizing the sequential ID as an order ID, but then usefully representing that in the interface. This sort of change to the guts of how LNT works is weeks of work, but I think it matches LNT’s goal of tracking performance changes as software evolves.