SVN dump seed file (was: svnsync of llvm tree)

Hi folks,

in a rather old thread on this list titled "svnsync of llvm tree"
<http://comments.gmane.org/gmane.comp.compilers.llvm.devel/42523&gt; we
noticed that an svnsync would fail due to a few particularly big commits
that apparently caused OOM conditions on the server. The error and the
revision number were consistent for different people.

That seems to be fixed now. I succeeded in pulling a full "clone" of the
SVN repository.

Back then it was noted that a SVN dump seed file could make it easier
for people to start svnsync-ing the LLVM source tree. Now, I'm hoping to
convince you to provide a seed file for another reason than just the
error from back then.

To my knowledge when using svnsync (just like svn one actually
transfers approximately what the svndump file size of the resulting
repo. Uncompressed that happens to be almost 15 Gigabytes, even though
the resulting SVN repository is only roughly 3.8 Gigabytes in size. For
this experiment I dumped the revision range 0:230000 to a file just to
have a "clean" range.

After compressing the dump with the lzma utility I had a file of 419
Megabytes.

If there is interest, I can upload the compressed dump (or make it
available for download), including a PGP signature on the files, and so
on for you to make available on the official servers. I'll even add
detailed steps on how to get to a repository again from there and how to
keep it up-to-date. And yes, I adjusted the repo UUID to match the
remote one (which is utterly useful to keep a local synchronized repo
and 'svn relocate' between upstream and the synchronized one depending
on availability or mood).

All I really did was:

  svnadmin dump $(pwd) -r 0:230000 > llvm.svndump

-k9e" the resulting dump file (the compression took more than 2 hours).

The main point is, that anyone trying to start synchronizing now will
have to transfer ~15 GiB of data to get to the current point. That can
be cut to ~420 MiB by providing a seed file, in the described case for
the revision range 0:230000 (and additional chunks such as 230000:250000
could be added later, or the base seed file could be updated accordingly).

Hope someone in charge reads and considers this.

With best regards,

Oliver

PS: feel free to contact me off-list about that, too.

Hi,

I think it would be easier to understand why you want this if you had a use case for having an svnsync clone. Aside from backing up the repository, it seems like a fairly useless thing: you can't do local commits and then upstream them and you can't do

If you want the complete history of the repository, then a git clone of the git-svn mirror will give you this very cheaply and with the added bonus that you can then commit to the local copy and still push things upstream (and merge changes from upstream). A fresh clone of the llvm and clang git mirrors transfers about 310MB for LLVM and about 190MB for Clang.

What do you want to do with the svnsync copy?

David

Hi,

I think it would be easier to understand why you want this if you had
a use case for having an svnsync clone. Aside from backing up the
repository, it seems like a fairly useless thing: you can't do local
commits and then upstream them and you can't do

well more or less for backup purposes, yes.

Local commits are not necessary, say, if I rarely commit. Because I
could 'svn relocate' my working copy to point to the upstream repo
before committing. And to reinforce that I really don't want to commit
to my local "clone" I can install a hook script preventing me from doing
that.

A very realistic use case is that people in a big organization could use
the same local copy to checkout and the upstream repo to commit.

SVN over HTTP isn't exactly the most efficient protocol there is, so if
I can skip thousands of revisions via SVN over HTTP and get going more
quickly, that helps a lot.

If you want the complete history of the repository, then a git clone
of the git-svn mirror will give you this very cheaply and with the
added bonus that you can then commit to the local copy and still push
things upstream (and merge changes from upstream). A fresh clone of
the llvm and clang git mirrors transfers about 310MB for LLVM and
about 190MB for Clang.

And the desire for getting that in the SVN form is mainly coming from
the imperfect representation of the history in the git-svn mirror (as
you will also find from some recent and older threads).

Besides, the llvm-project SVN repo contains *everything* whereas the Git
mirrors provide only slices of a handful as far as I know.

I know git-svn kind of works, but it's a crutch. But of course it
provides the means for cooperation between SVN and Git users, so that
aspect is good.

What do you want to do with the svnsync copy?

Personally I wanted to keep an updated SVN copy and work on providing a
better (continuously updated) Git and Mercurial representation of the
repository (or repositories, need to look into that).

I have gathered experience with repository conversion (using
reposurgeon) and thought I could apply it to something more complex,
too. I also learned that there are quite a few "Subversionisms" that can
cause problems in the conversion. Likely one of the reasons why the
git-svn mirrors are not quite perfect.

This could be quite useful if and when LLVM decides to move to any other
version control system that supports git-fast-import streams. For this,
of course, one needs a true-to-the-bit copy of the repo and not
something like the imperfect git-svn mirror :slight_smile:

Also within our company I'd like to use it in the same way I described
above (it's not the first repo I do this with, but it's one of the
bigger ones). On the local network the transfer even of large amounts of
data is a lot faster. I cannot say I ever enjoyed the speed of SVN much,
but at least on the local net it becomes less of a nuisance.

Of course it's easy to put me on the spot about use *cases* when I only
have one or two for myself, but there are probably more use cases out
there due to some of the inherent weaknesses of SVN compared to
distributed version control systems.

The mailing list thread back then proved there was demand for it.
Creating a dump is a one-time effort. It can (but doesn't have to be)
redone every few ten thousand revisions to provide additional chunks on
top of the initial seed file or even create a fresh base seed file. And
it saves traffic.

I cannot offer more reasons than that, sorry. Perhaps someone else can.

Oliver