Bugzilla migration is stopped again

Dear All,

I hate to say this, but the migration was stopped again. Now it seems
that GitHub does not rewrite issue references properly during the
transfer (sick!). Let me show what the problem is exactly:

Consider two issues: A and B, where A will reference B and B will
reference A. In our case this is used to model various relations like
"duplicates / is duplicated by", "blocks / is blocked by", "depends on
/ required by". So, in bz archive A will reference B as #B and B
will reference #A.

Now, let's migrate A. The references will be rewritten. #B =>
bz-archive#B and #A => llvm-project#A. However, after migration of B
only one reference is rewritten llvm-project#A => #A, the bz-archive#B
link in the issue A will not be rewritten and therefore a dangling
reference will appear.

For us this means that we will lose all links to duplicate issues, and
(more important!) to linked issues in the meta bugs.

I informed GitHub about the bug and I am waiting for their answer.

This may be a dumb question, but could this just be an issue of forward references (i.e. issue A references B, but B has not been transferred yet so doesn’t exist)?
If so, could the transfer be split into a two step process:

  1. Open all issues with summaries only

  2. Populate description, comments, labels, etc.

Please note that I have no idea how any of this GitHub or Bugzilla stuff works, so this suggestion may be completely absurd.

Hello

This may be a dumb question, but could this just be an issue of forward references (i.e. issue A references B, but B has not been transferred yet so doesn't exist)?
If so, could the transfer be split into a two step process:

It cannot, sorry

How many issues are we talking about with circular dependencies? Small enough for us to fix by hand on a need to basis? At present the lack of bug tracker for 10 days is starting to be more painful than the 100% correctness of the data. My experience of multiple migrations to JIRA systems from various legacy bug trackers that this is an iterative processes at some point you say “That’s good enough” and you conclude that historical issues aren’t looked at enough to worry about it, as long as users can get back to the legacy system they can cross reference if necessary

Once you open this up, most issues we tackle will be new issues created in GitHub.

Time to move forward, ideally we should have done these kinds of migrations during the planning phase, but we didn’t and that’s a lesson learnt, but let’s finish up the migration as is. Move on in my view it looks good enough to use and you’ve done a good job but let’s not drag this out for 1% of bugs we might not look at much!

My 2p worth

MyDeveloperDay

To clarify, will those bz-archive#B references just not look as nice, or do they not work at all? It was my understanding that bz-archive#B will redirect to llvm-project#B2, in which case this doesn’t seem overly problematic.

Regards,
Nikita

How many issues are we talking about with circular dependencies? Small enough for us to fix by hand on a need to basis? At present the lack of bug tracker for 10 days is starting to be more painful than the 100% correctness of the data.

This affects:
  - All issues closed due to being duplicates
  - All meta bugs (including release-tracking ones)

Maybe something else, but IMO the meta-bugs case is quite severe.
Essentially we will lose all links from the meta-bug to the
"downstream" bugs, or vice versa, or mixture depending on the relative
order of migration of meta-bug and the dependees. I will be meeting
with GitHub folks on Monday to see what are the solutions.

Time to move forward, ideally we should have done these kinds of migrations during the planning phase, but we didn’t and that’s a lesson learnt, but let’s finish up the migration as is. Move on in my view it looks good enough to use and you’ve done a good job but let’s not drag this out for 1% of bugs we might not look at much!

Well. The thing is: we checked and found lots of issues and introduced
many workarounds. I must admit that checking cyclic references after
the migration was not in my checklist and I spotted this issue by
accident. Ordinary references are migrated properly (both to source
code and other issues) and this was checked. There was an assumption
that basic github functionality would simply work. This was a mistake.

To clarify, will those bz-archive#B references just not look as nice, or do they not work at all? It was my understanding that bz-archive#B will redirect to llvm-project#B2, in which case this doesn't seem overly problematic.

They will be just text. No redirect.

Is it really impossible to just completely remove all the current
issues and PR's in a repository and reset the counter, so that none of
this remapping is necessary in the first place?

Alternatively, is it really impossible to, instead of moving issues,
ask github to just move the releases into that new repository and then
swap those two repositories (forks, stars, clones, etc.)?

I think all these problems are only because of the remapping, which
will be problematic regardless, because the in-source mentions aren't
getting rewritten, so there *will* be confusion regardless of whether
github succeeds in moving issues.

Roman

Is it really impossible to just completely remove all the current
issues and PR's in a repository and reset the counter, so that none of
this remapping is necessary in the first place?

I asked this question many times at different levels. As far as I was
told – yes. The bulk import could only happen to the empty repo. If
you know how it could be done in another way – please let us know.

Alternatively, is it really impossible to, instead of moving issues,
ask github to just move the releases into that new repository and then
swap those two repositories (forks, stars, clones, etc.)?

This is what we asked as well!. The answer was "there is no way".
Maybe there is a way, but it would require some significant
engineering effort from their side (e.g. additional development), so
our request was refused.

I think all these problems are only because of the remapping, which
will be problematic regardless, because the in-source mentions aren't
getting rewritten, so there *will* be confusion regardless of whether
github succeeds in moving issues.

Right. Do you have an idea how we can move forward?

> Is it really impossible to just completely remove all the current
> issues and PR's in a repository and reset the counter, so that none of
> this remapping is necessary in the first place?
I asked this question many times at different levels. As far as I was
told – yes. The bulk import could only happen to the empty repo. If
you know how it could be done in another way – please let us know.

> Alternatively, is it really impossible to, instead of moving issues,
> ask github to just move the releases into that new repository and then
> swap those two repositories (forks, stars, clones, etc.)?
This is what we asked as well!. The answer was "there is no way".
Maybe there is a way, but it would require some significant
engineering effort from their side (e.g. additional development), so
our request was refused.

> I think all these problems are only because of the remapping, which
> will be problematic regardless, because the in-source mentions aren't
> getting rewritten, so there *will* be confusion regardless of whether
> github succeeds in moving issues.
Right. Do you have an idea how we can move forward?

Once the issues are imported into a clean llvm-project-NEW repository,
push tags into it, manually* recreate github releases - why do their dates
matter? - by manually* re-uploading all the manually uploaded assets,
then lock down the old llvm-project, rename it to llvm-project-obsolete,
mirror it's branches into the new repo, and finally rename the new
llvm-project-NEW to llvm-project. And delete llvm-project-obsolete.
As far as git is concerned, by now llvm-project repo is exactly
identical as the old one.

The only casualties will be unimportant things:
github stars, github forks, github release dates;
but if github can't be bothered to help with those,
it will serve as a forever reminder to the users that github is unreliable,
and false dependence should not be created on replaceable unreliable things.

* Surely it can be automated.

--
With best regards, Anton Korobeynikov
Department of Statistical Modelling, Saint Petersburg State University

Roman

The only casualties will be unimportant things:
github stars, github forks, github release dates;
but if github can't be bothered to help with those,
it will serve as a forever reminder to the users that github is unreliable,
and false dependence should not be created on replaceable unreliable things.

Thanks for your opinion. However, this was previously discussed and it
was decided that release dates do matter as well as forks. The latter
is even more important for downstream users.

Surely, if the community will re-decide that these are unimportant
things we can push the existing code into a blank archive fairly
quickly.

> The only casualties will be unimportant things:
> github stars, github forks, github release dates;
> but if github can't be bothered to help with those,
> it will serve as a forever reminder to the users that github is unreliable,
> and false dependence should not be created on replaceable unreliable things.
Thanks for your opinion.

Yep, it is indeed *just* *my* opinion, formed by observing the situation.

However, this was previously discussed and it
was decided that release dates do matter as well as forks. The latter
is even more important for downstream users.

Surely, if the community will re-decide that these are unimportant
things we can push the existing code into a blank archive fairly
quickly.
--
With best regards, Anton Korobeynikov

Roman

This would seem like a sensible approach to me.
I have worked with many different graph databases, and it is quite
normal to load the nodes or entities first, and then add the
relationships or links second.
Maybe one can add all the bugs first, without any relationships/links.
Then build up a map of github IDs vs bugzilla IDs, and then use that
map to then add all the relationships afterwards using the learnt
bugzilla IDs.
Or, alternatively, use your current method, and then scan over
everything at the end, to add in any relationships that got missed
using the above approach.

Kind Regards

James

Hello,

Maybe one can add all the bugs first, without any relationships/links.
Then build up a map of github IDs vs bugzilla IDs, and then use that
map to then add all the relationships afterwards using the learnt
bugzilla IDs.
Or, alternatively, use your current method, and then scan over
everything at the end, to add in any relationships that got missed
using the above approach.

There are few things that you're missing here, unfortunately:

0. Everything is API-rate limited.
1. Every change triggers notifications
2. Every change updates "last modified" timestamp

And no, GitHub is not a graph database which you can use as you could imagine.

Though, patches are always welcome :wink:

Please, test the above claim this week, on a blank repo. Let’s actually find out whether it works, instead of relying on “Surely…”.

At this point I’m offering my own technical assistance, just to get the thing done and stop getting these emails every day. Send me your Bugzilla export script; I’ll test it out this week on a blank repo, with the goal of mirroring a 100-bug subset of the LLVM Bugzilla publicly visible in https://github.com/Quuxplusone/LLVMBugzillaTest/ by EOW.

(Credentials: I was SRE at Mixpanel for ~3 years and performed several 100GB cluster migrations with zero downtime. I have seen the “We’ll do it live!” attitude be successful, but I have also seen it fail spectacularly. The alternative “plan it carefully, write down your deploy plan, test what can be tested ahead of time, do a practice run, then do it live” approach usually works better. At this point it looks like Anton’s initial pass at “We’ll do it live!” clearly was not successful, in the sense that if it were successful the repo would have been migrated circa Thanksgiving weekend. So this is the giant-honking-red-flashing-light alert that it’s time to shift from “We’ll do it live!” to “Let’s make a deploy plan.”)

Respectfully, yet frustrated with the never-ending email thread,
–Arthur

Dear Arthur,

Respectfully, yet frustrated with the never-ending email thread,

I understand your frustration and please rest assured that my own
frustration is certainly not less than yours. I'm also very exhausted
at the moment as the things are beyond my control. The constant
pushing from this and similar emails does not help in resolving the
situation. I certainly have to note that your accusations in "we'll do
it live section" are not quite accurate in many aspects - if you have
not seen the outcome of test imports, then it does not mean that there
were none. I would say even more, this means that they were successful
as nothing triggered excessive notifications (we made such a mistake
once – you could even find the reports of this in the MLs) . For your
information: the last "dry run import" which gated the migration was
14th full (52k issues) try. In all previous runs issues were found and
either a workaround was prepared or they were reported to GitHub.

Now let's proceed from the emotions to the real things. I do
appreciate your willingness to provide the help. Please see my notes
below.

I'll test it out this week on a blank repo, with the goal of mirroring a 100-bug subset of the LLVM Bugzilla publicly visible in https://github.com/Quuxplusone/LLVMBugzillaTest/ by EOW.

First of all, proof-of-concept should certainly be 10k issues on
non-blank repo. It should already have closed issues / pull requests
in order to represent the real llvm-project repo.

Now down to details:
0. I would suggest you not to use the GitHub API. YMMV, but from our
experience: API is rate limited, and many things are outside your
control including:
  - ids
  - timestamps
  - notifications
1. The real migration starts from a local gitlab instance, where you
import all bugzilla issues. You can certainly skip this step in your
own experiments and proceed directly to step 2, but this will allow
you to check the outcome of the import. The script we used could be
found at https://github.com/llvm/bugzilla2gitlab/tree/llvm

2. Then you need to prepare the dump which could be consumed by GitHub
Enterprise Migration API:
https://docs.github.com/en/rest/reference/migrations We are using
gitlab-to-github scripts provided by GitHub. I'm not sure I can share
them as they are not public – I will ask GH support engineers on
Monday and will return to you.

3. After the dump is prepared you need to upload it via GitHub
Enterprise Migration API. Note that import is only possible into empty
repo (it is essentially created). If the import failed you'd need to
ask GitHub engineers whether the error is real or whether it could be
ignored. If the error is real, then you'd likely need to restart from
scratch – it is possible to resume, but practice shows that this might
create duplicate comments.

4. After the import finished check the results: number of objects
(issues, comments, attachments) that were imported. If there are any
objects that failed import, then you need to figure out which ones and
what to do. Your options are: ignore or restart the import. Here is
the checklist I'm using for the content:
https://docs.google.com/document/d/1G6DZ6AxzSaOlrtTxoxtqYKnD4Myv40QfKK4wj54y8ms/edit

5. At this point one should have something similar to
https://github.com/llvm/llvm-bugzilla-archive

6. In order to transfer issues from the archive to the live repo there
are two options:
  - Use GitHub rate-limited API
  - Ask GitHub folks
The first variant triggers notifications to everyone mentioned,
assigned or commented on the issue. There is no way to silence these
notifications.
In our case here we are relying on GitHub support engineers that do
this migration step for us. There is no API, no script, nothing that
is within our control. We did several test migrations from dry-run
repo to another repo (and this is how we found all bugs wrt issue
transfer in the past). As I already said, the circular reference
rewriting was not included into my original checklist - I expected
that this feature "just works" and was only spotted later.

Hope this helps. Should you have more questions, I will certainly be
happy to help you. I'm interested in finishing this 2+ year project
more than anyone else.

Is “cancel the migration permanently” an option? From what I can tell,
GitHub doesn’t seem like a very reliable choice compared to Bugzilla.
Could someone else (perhaps Mozilla?) help host the Bugzilla instance?

Hello

As far as I know, so far no one volunteered to fix our existing
Bugzilla instance (e.g. integrate with some other authentication
providers). Not sure about alternatives.

In the recent threads there were opinions that permanent abandonment
is not an option, so...

Trying to think of a solution that might help while we wait for github.

  1. Can’t we calculate in advance the eventual ID of each issue. can’t we determine that bugzilla PR12345 = GH12345 + (some offset caused by previous issues in GH)? - (assuming we always import in the same order, oldest first 1 at a time)
  2. Could you build up this mapping a priori?
  3. Could you then spin through the bugzilla issues prior to migration (assuming they are in some form you can manipulate, JSON, XML, TXT?) etc…programatically changing the links to what they ultimately will be before doing the migration?
  4. Then import that into the github (new bugzilla-archive)
  5. then copy those issues from one repo to another?
  6. Assuming all ducks are lined up correctly wouldn’t these broken links now seemingly point to the correct links?

Is that something that might be worth a try? or do you do this already and GH is messing it up?

MyDeveloperDay.

Hello

1) Can't we calculate in advance the eventual ID of each issue. can't we determine that bugzilla PR12345 = GH12345 + (some offset caused by previous issues in GH)? - (assuming we always import in the same order, oldest first 1 at a time)

The thing is... it's not simple "+" here, the things are a bit more
complex, as we do have gaps in bugzilla id's as well. Some of the
issues were removed due to spam or GDPR requests. So we'd need to
track the things, but this is doable, yes provided that id mapping
that is done by GitHub is predictable. I... cannot be 100% sure as the
final transfer is done not by myself, but by GitHub support engineers
(in order not to trigger notifications on all 52k+ issues).

Is that something that might be worth a try? or do you do this already and GH is messing it up?

The latter essentially. The references were properly built, but
towards the original archive repo. It is assumed that GH will rewrite
them during the transfer. This is the standard functionality and I was
assured that it works properly, it was tested, deployed, and worked
for many years, etc. etc. etc. Now we are caught halfway as we already
migrated ~13k issues to the main LLVM repo. As I said, I spotted the
problem by chance, checking for the circular links rewriting was not
in my checklist, when I checked links rewriting during the test
migration I checked essentially "one way" and everything was ok (one
needs to migrate both sides of the reference in order to see the
problem and apparently there were none of them in the test 100 issues
we migrated).

I'm meeting with GitHub folks today (morning Pacific Time) to discuss
the options. One option is to proceed with the transfer and rewrite
the stale links afterwards. But I'm wondering if there is a way to fix
the issue on GH side and what is the ETA.