cast<X> is broken, implications, and proposal to address

It appears that for roughly the last three months, cast<X> has not been asserting that X actually isa<X>. This is very unfortunate as the entire, explicitly documented in our developer guide, purpose of cast<Ty>(X) is to provide a checked cast.

I have a patch which restores the asserts into cast, but as you might expect, we have some breakage in tree. This email lays out a proposal for how to handle this.

Current Status

As of now, the patch linked above does not build cleanly on ToT. We get template instantiation errors. Getting us into a buildable state is easy, I plan to land the require changes shortly after posting this.

Unfortunately, this reveals that we have significant breakage in tree where changes have landed which falsely cast to unrelated types. These are likely all programming errors which developers and reviewers assumed would be caught by cast. (This is how I noticed this issue myself.)

Here are the current failures in ninja check for core LLVM and clang.

Failed Tests (23):
Clang :: Modules/prune.m ā† this is a flaky test, ignore it
LLVM :: Analysis/BasicAA/intrinsics.ll
LLVM :: Analysis/TypeBasedAliasAnalysis/intrinsics.ll
LLVM :: CodeGen/AMDGPU/rewrite-out-arguments-address-space.ll
LLVM :: CodeGen/AMDGPU/rewrite-out-arguments.ll
LLVM :: Transforms/GVN/masked-load-store-vn-crash.ll
LLVM :: Transforms/GVN/masked-load-store.ll
LLVM :: Transforms/GuardWidening/basic.ll
LLVM :: Transforms/GuardWidening/basic_widenable_condition_guards.ll
LLVM :: Transforms/GuardWidening/loop-schedule.ll
LLVM :: Transforms/GuardWidening/range-check-merging.ll
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/26/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/27/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/28/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/29/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/30/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/31/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/32/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/33/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/34/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/35/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/36/38
LLVM-Unit :: Transforms/Vectorize/./VectorizeTests/37/38

Proposed Approach

I propose, and plan to move forward with unless objections are raised, the following steps.

First, aggressively revert changes or fix forward if thatā€™s easiest to resolve each of the above failures. (I could greatly use help with this, see below.)

Second, land the above linked patch. However, unlikely your typical patch, I plan to explicitly ask in the commit message that it not be reverted.

Once his lands, I expect to find further cast violations in less common configurations. Critically, I do not plan to revert when this happens, and instead will push the owners of those configurations to fix forward. We will be knowingly trading the stability of core LLVM and our ability to be confident in ongoing development against build failures for lessor used sub-projects. I think this is the right tradeoff in this case.

Timeline wise, I hope to accomplish steps one and two above in the next couple of days.

How can I help?

We urgently need folks knowledgeable of the respective areas to investigate each failure, fix forward or revert bad changes.

If you have a configuration which is anything other than default LLVM + clang that you care about, please apply the above patch and run tests. You will probably find problems which need to be fixed. Please either fix them yourself, or reply to this thread with your findings.

FAQ

Q: How did this happen?

This was done accidentally in āš™ D123901 [LLVM][Casting.h] Update dyn_cast machinery to provide more control over how the casting is performed., which was a fairly major reworking of the casting infrastructure to support new MLIR inspired use cases.

Q: The change above broke my downstream project, will you revert?

No.

Q: The change above broke this LLVM subproject, will you revert?

Probably not. This will be case by case, and may depend on the severity of the breakage. However, as noted in the writeup above we have a tradeoff here between leaving core LLVM exposed to further bad changes landing in tree, and the stability of each sub-project. We strongly encourage each sub-project to fix forward urgently.

Philip

5 Likes

Plan seems fine. I think that, per normal development policy, we should ensure buildbots donā€™t stay red for a significant amount of time. Iā€™m okay with the strong preference to fix forward, though.

If youā€™re asking people to look into failures, bisected commits would be helpful.

Thank you for catching and bringing this up! There was meant to be an assert in the default CastInfo implementation so that clients can avoid the cost of ā€˜isaā€™ when appropriate to do so, but clearly that didnā€™t make it in.

+1 for landing the patch to re add the asserts! I would generally prefer fixing forward as well, and Iā€™ll help out where possible.

Iā€™m surprised that there was no testcase that caught this. Considering the intent of cast<Ty>(X) is to provide a checked cast, Iā€™d expect a test that verifies the asserts. (Or something)

If possible, we should probably add a test, and maybe defensively add similar tests to other types that might drop important asserts in the event of a refactor/rework.

1 Like

@efriedma-quic I hope to not leave bots red, but if we end up in a situation where a bot is failing, and I canā€™t easily reproduce that failure, I do plan to leave that to the bot owner and respective community. So, this goes a bit beyond just preferring to fix forward.

On the bisection point, unfortunately, this is a bit hard to bisect. Since a) the asserts werenā€™t in tree, and b) we have multiple triggering commits, the scripting to automate this would be quite involved. If someone wants to volunteer for this, I welcome it, but I suspect just doing this manually will be much faster.

@barrelshifter I completely agree on the testing point. Not my immediate focus, but if no one else gets to it first, I will probably come back to this after all of the other fallout is resolved.

1 Like

Status Update

As of now, ToT appears to build cleanly (make check) with cast asserts added. Iā€™m about to do a clang build with the fixed assert enabled stage1, but if anyone has other sub-projects they want to build, now would be a good time. I have only built and tested core LLVM and clang.

Assuming no further issues are found before, I plan to commit the asserts first thing tomorrow morning. My hope is to catch the builders at a relatively quiet point so we get quick feedback.

Getting here was faster than expected.

The fixes landed so far are:
4f5648adc588e23e2bf52e189030c792a6f1586a
781de11f42a863f0328cdc6889f6fd48787ddd72
8a0cd23326050a7b999efa2cf455880772ce9211
89c4b29e8d35ec352019d828e546bea3850403df
b0c9a71be017594b9376f1bdaa01a8f7e18f83ef

Several of these are potentially nasty miscompies; if anyone has been chasing something weird (e.g. memory corruption), Iā€™d advise a rebase and see if the problem is now fixed.

3 Likes

Itā€™s a bit difficult:

  • testing for asserts behavior is a bit frowned upon (or used to be, I had hard time getting through code reviews in the past)
  • Weā€™d need some targeted C++ unit-test to do this (I think) and this isnā€™t in LLVM habits to fully covered its internal APIs through C++ unit-tests.

That said Iā€™m in favor of more tests in general!

1 Like

llvm::Error is tested to assert-fail (using gunit death tests) to ensure its programmer checks are working - I think the cast issue could/should be tested in the same way

(testing ā€œwhat happens if this invariant is brokenā€ behavior for utilities is, I think, incorrect (their behavior is undefined) - but for utilities designed to catch programmer errors, they should be tested in this way to ensure those guardrails are functioning correctly)

4 Likes

(testing ā€œwhat happens if this invariant is brokenā€ behavior for utilities is, I think, incorrect (their behavior is undefined) - but for utilities designed to catch programmer errors, they should be tested in this way to ensure those guardrails are functioning correctly)

Yeah I think this is the distinct difference here between testing asserts in general.

The assert in cast<X> is documented expected behaviour, so IMO it makes sense to test it as basically a requirement for cast<X>:

The cast<> operator is a ā€œchecked castā€ operation. It converts a pointer or reference from a base class to a derived class, causing an assertion failure if it is not really an instance of the right type

Status Update

As of about 7:30am PT this morning (roughly 3 hours ago as of writing), asserts have been reintroduced in change f0d2a55d3aa68827ad708a16650bb0b341ca65bb.

I have seen no build failures resulting as of yet. If anyone sees breakage from this, please either respond here or use the commit thread to coordinate.

6 Likes

Thanks a lot @preames !

Added some death test coverage for this in https://github.com/llvm/llvm-project/commit/23d6f31a90a1631e7202a6ae6aac19c8245a8da1

(anyone happen to know if the assertion could be sunk lower so itā€™s not repeated in 4 places? Would reduce the testing too if it was all in one place)

1 Like

@dblaikie Thanks for adding the tests! As for your question, I would encourage us not to move the asserts. The assert in this case is part of the semantic interface of the routine, and having the error produced at an easy to understand point (i.e. not buried in generic implementation logic) seems useful from a user interface perspective.