bit code file incompatibility due to debug info changes

I just filed this as pr17982, but I'd like to call attention to it, since it seems like something we should try to fix in the 3.4 release.

Here's the description from the PR:

Sometime over the last year, and I'm fairly certain it has been since we released 3.3, the debug info metadata format has changed in incompatible ways. If you try to use an older bit code file with the current compiler, it is not uncommon for the compiler to crash. This goes against our goals for keeping backward compatibility for bit code files and is especially problematic for LTO since you can't reliably mix objects from different versions of clang.

At a minimum, it seems like we need a version number in the debug info metadata so we can detect this situation and avoid crashing.

I don't have a test case but we should be able to come up with one if needed.

I agree that this really needs to be fixed before 3.4...

Sometime over the last year, and I'm fairly certain it has been since we
released 3.3, the debug info metadata format has changed in incompatible
ways. If you try to use an older bit code file with the current compiler,
it is not uncommon for the compiler to crash. This goes against our goals
for keeping backward compatibility for bit code files and is especially
problematic for LTO since you can't reliably mix objects from different
versions of clang.

I don't think that you need LTO for this to be a problem. The bitcode
really has to be readable by future versions of LLVM.

At a minimum, it seems like we need a version number in the debug info
metadata so we can detect this situation and avoid crashing.

Or to put it in the terms of the IR: we need to autoupgrade the debug info
metadata just like we do intrinsics. With debug info this might (at the
worst) involve dropping old metadata.

I don't have a test case but we should be able to come up with one if
needed.

I think doing so would be useful.

Hi Bob,

I just filed this as pr17982, but I'd like to call attention to it, since it seems like something we should try to fix in the 3.4 release.

Here's the description from the PR:

Sometime over the last year, and I'm fairly certain it has been since we released 3.3, the debug info metadata format has changed in incompatible ways. If you try to use an older bit code file with the current compiler, it is not uncommon for the compiler to crash. This goes against our goals for keeping backward compatibility for bit code files and is especially problematic for LTO since you can't reliably mix objects from different versions of clang.

At a minimum, it seems like we need a version number in the debug info metadata so we can detect this situation and avoid crashing.

This is not going to be fixed.

Metadata formats are not guaranteed to be stable and this has been an
explicit design factor. The versioning support was removed many months
ago and the debug info format has been moving quickly. This has been
stated many times on list, in discussion, and with patches.

http://llvm.org/viewvc/llvm-project?rev=176838&view=rev

is the revision where the versioning, that didn't work anyhow, was
finally removed back in March.

Once the verifier works better it can discard debug metadata that
fails verification if we'd like to go that route.

-eric

At a minimum, it seems like we need a version number in the debug info
metadata so we can detect this situation and avoid crashing.

Or to put it in the terms of the IR: we need to autoupgrade the debug info
metadata just like we do intrinsics. With debug info this might (at the
worst) involve dropping old metadata.

The verifier is probably the best route here. I think if we fail debug
info verification we can just strip it and continue.

-eric

I'm slightly worried about relying on verification based "autoupgrade"
(where by upgrade I mean discard old cruft since it doesn't impact
correctness) due to compile time costs, but if the verification is fast
enough to be "always on", then cool.

Agreed, it's not really that fast either. The problem with versions is
that it makes small changes that only require a bit of modification to
the testcase to be pretty heinous upgrade paths and a lot of work -
you can see that from the work any of us have had to do when changing
the format. We'd like to keep the testcases and upgrade those as well
because otherwise we're just dropping testcases like mad constantly.
It's a lot of work, but the format has been changing frequently enough
that we'd be on version 1000 or something by now and it wouldn't
really accomplish much. The verifier seems to be the best route for
that. It isn't too fast, but then reading in bitcode files from disk
isn't particularly speedy either.

-eric

It depends a bit, also, on what kind of guarantees we need to offer. If the
guarantee when reading IR from disk is "will not crash" then there's
nothing for it but to run full debug info verification.

On the other hand, if we can assume that some specific metadata implies the
correctness of some other metadata, then all we need to do is check a known
debug info version number. If it's the current number, do nothing (assume
the rest of the debug info is up to date and well formed), otherwise do all
the work to find the debug info and dump it (no need to verify it - we just
have to do the work to find it, so we don't go following bad links later on
- that should be as easy as dropping the llvm.dbg.cu named node and
removing all debug intrinsics and the instruction metadata line
references). But this latter scheme isn't robust against arbitrary metadata
(that could, coincidentally, have the right version number and arbitrary
metadata that breaks all our debug info metadata assumptions)

If the latter is sufficient for everyone's needs/principles, great.

- David

This makes sense to me, but I see Eric's fundamental concern with upgrading
test cases.

One other possible idea: we could artificially force coarseness on metadata
while things are still iterating rapidly by just having a version number
that rotates every "release". (Even non-open-source-releases could be
handled by letting anyone, any time they need to rev it, update what the
current version is and update every test case to reference that version.)
If the version is nicely factored, it should at least be an obvious diff if
still huge.

Eric, why did you move the PR back to resolved/won’t-fix again after I reopened it? We really need to do something here. I’m open to discussing any of the possible solutions that have been offered but just letting the compiler crash does not seem acceptable.

For the same reason that you asked me in two threads: because it's
splitting the discussion up and making it more complicated. Let's keep
it to one thread and then open a bug at the end when/if we have
something that we can take action on.

-eric

That could work, it'd be a simple script to say "update all of the
debug info testcases to move to the new version" between versions but
it probably wouldn't be too bad. Part of the branch cut.

-eric

OK. That is fine with me. Saying "WONTFIX" seems to be a bit different than "let's carry on the discussion on the mailing list until we figure out what to do", though. I had intended to use that PR to track the fact that we have an issue here that I think needs to be addressed before the 3.4 release. I don't care where the discussion happens.

In the bug report, Rafael commented that the compiler should never crash, even if the debug info metadata is old. I certainly agree, and we should fix that. I don’t think that really solves this problem, at least not by itself. For one thing, if the metadata format is really changing so rapidly, how do we ensure that we have fixed all the problems and that we don’t regress? Is there a plan for testing the behavior with a wide variety of old debug info?

What is the user experience in the case where the debug info is out of date? Not crashing is a pretty low bar. Might we end up emitting DWARF that is incorrect? If we’re just going to drop the old debug info, I would expect a warning diagnostic to report that.

In the bug report, Rafael commented that the compiler should never crash, even if the debug info metadata is old. I certainly agree, and we should fix that. I don’t think that really solves this problem, at least not by itself. For one thing, if the metadata format is really changing so rapidly, how do we ensure that we have fixed all the problems and that we don’t regress? Is there a plan for testing the behavior with a wide variety of old debug info?

Not a bit. The idea behind the verifier is that if it sees something
unexpected it can do something about it whether it's emitting a
diagnostic (though we don't emit diagnostics from the back end), or
just removing the metadata in memory ala strip (not on disk - that
really would be unfriendly).

What is the user experience in the case where the debug info is out of date? Not crashing is a pretty low bar. Might we end up emitting DWARF that is incorrect? If we’re just going to drop the old debug info, I would expect a warning diagnostic to report that.

A warning is about the best you can expect. The metadata is basically
locked with assumptions on fields throughout the code. There's really
no way to continue reliably.

-eric

In the bug report, Rafael commented that the compiler should never crash, even if the debug info metadata is old. I certainly agree, and we should fix that. I don’t think that really solves this problem, at least not by itself. For one thing, if the metadata format is really changing so rapidly, how do we ensure that we have fixed all the problems and that we don’t regress? Is there a plan for testing the behavior with a wide variety of old debug info?

Not a bit. The idea behind the verifier is that if it sees something
unexpected it can do something about it whether it's emitting a
diagnostic (though we don't emit diagnostics from the back end), or
just removing the metadata in memory ala strip (not on disk - that
really would be unfriendly).

I understand the idea behind the verifier. I'm asking about testing. If we're relying on the verifier for correctness, then we need to test that on a regular basis. I know we have a good number of tests with valid metadata. How do you propose that we test to make sure the verifier catches all invalid metadata? This seems like a weak point of the proposal to rely on the verifier rather than a version number.

What is the user experience in the case where the debug info is out of date? Not crashing is a pretty low bar. Might we end up emitting DWARF that is incorrect? If we’re just going to drop the old debug info, I would expect a warning diagnostic to report that.

A warning is about the best you can expect. The metadata is basically
locked with assumptions on fields throughout the code. There's really
no way to continue reliably.

-eric

This really is a pretty horrible user experience, even with a warning.

I'd just like to comment that we're actually running into this problem at Apple. It is not an idle concern. We had a static library built with LTO using an older version of clang. When linking that with a recent version of clang, the compiler crashed. Obviously we need to fix the crash. Even if we don't crash, emitting incorrect debug info or silently removing the debug info is only somewhat better. Stripping the debug info and providing a warning would be a reasonable short-term workaround. I don't think we can get away with doing anything less than that. For the longer term, perhaps we ought to revisit the decision to not auto-upgrade debug info metadata, at least once the format stabilizes. I had not realized the implications of that for LTO at the time it was originally discussed.

In the bug report, Rafael commented that the compiler should never crash, even if the debug info metadata is old. I certainly agree, and we should fix that. I don’t think that really solves this problem, at least not by itself. For one thing, if the metadata format is really changing so rapidly, how do we ensure that we have fixed all the problems and that we don’t regress? Is there a plan for testing the behavior with a wide variety of old debug info?

Not a bit. The idea behind the verifier is that if it sees something
unexpected it can do something about it whether it's emitting a
diagnostic (though we don't emit diagnostics from the back end), or
just removing the metadata in memory ala strip (not on disk - that
really would be unfriendly).

I understand the idea behind the verifier. I'm asking about testing. If we're relying on the verifier for correctness, then we need to test that on a regular basis. I know we have a good number of tests with valid metadata. How do you propose that we test to make sure the verifier catches all invalid metadata? This seems like a weak point of the proposal to rely on the verifier rather than a version number.

Every time we change it check it in as an additional "don't die" test?
Right now there are a lot of asserts, moving those to being failures
would probably work. How would you like to version it? Other than at
bitcode load time I don't want to do anything with the version number
in the debug info. The previous versioning scheme became quickly
unmaintainable and I don't want to go back to that either.

What is the user experience in the case where the debug info is out of date? Not crashing is a pretty low bar. Might we end up emitting DWARF that is incorrect? If we’re just going to drop the old debug info, I would expect a warning diagnostic to report that.

A warning is about the best you can expect. The metadata is basically
locked with assumptions on fields throughout the code. There's really
no way to continue reliably.

-eric

This really is a pretty horrible user experience, even with a warning.

I'd just like to comment that we're actually running into this problem at Apple. It is not an idle concern. We had a static library built with LTO using an older version of clang. When linking that with a recent version of clang, the compiler crashed. Obviously we need to fix the crash. Even if we don't crash, emitting incorrect debug info or silently removing the debug info is only somewhat better. Stripping the debug info and providing a warning would be a reasonable short-term workaround. I don't think we can get away with doing anything less than that. For the longer term, perhaps we ought to revisit the decision to not auto-upgrade debug info metadata, at least once the format stabilizes. I had not realized the implications of that for LTO at the time it was originally discussed.

I don't think we're really disagreeing with each other here. It's not
a good experience and what we've said previously is that you really
can't rely upon the metadata being compatible between versions -
including and especially LTO. In the future when we're not trying to
change the various formats and needing to expand things so that we can
cover more debug information then I think finally locking down the
format will be a good thing - that time just isn't now. If you have
some specific ideas in mind for this I'm very glad to discuss them
with you.

-eric

I don't have a strong opinion on how we fix it, but I think it's really important to do something and get it done for the 3.4 release.

Here's what I would do:

- implement Quentin's previous proposal for back-end diagnostics (and add a diagnostic handler to Apple's linker), so that we can warn about out-of-date debug info

- add a version number to the debug info metadata

- have the bit code reader strip out any debug info that is missing the version number or has a different version than the current value

I don't have a strong opinion on how we fix it, but I think it's really
important to do something and get it done for the 3.4 release.

I agree with everything here being a reasonable step for the 3.4 release
except:

- implement Quentin's previous proposal for back-end diagnostics (and add
a diagnostic handler to Apple's linker), so that we can warn about
out-of-date debug info

I don't think this is important for the release. I don't think open source
people have any expectation of stability here, and while it is a
nice-to-have, it doesn't seem necessary, and seems clearly like a
significant extension that shouldn't be cherry-picked late into the release.

OK, that’s fair. As long as we can get it on trunk soon, we can cherry-pick it to an internal release branch.

Two questions:

  1. Can we reopen the bugzilla PR that I filed and use that to cover the additional of a version number and code to ignore the debug info when the version number does not match? Eric has already closed it twice, so I’m asking this time instead of just reopening it.

  2. Who is going to do this work?