As documented in the programmers manual, wherever possible we want to use recoverable errors that (if necessary) are presented to the user in an appropriate way. However, we don’t always have the infrastructure available to do this and so there are certain places in the backend where there’s not much option but to report_fatal_error. By default this prints a backtrace and a suggestion to file a bug report, behaviour which is controlled by the GenCrashDialog parameter. In many cases, this isn’t appropriate and so it is better to set it to false (i.e. the fact the erroneous condition was encountered isn’t a bug - it’s just a situation where compilation can’t continue).
In D150674. I tried to convert various report_fatal_error callsitse to disable the crash dialog where it wasn’t appropriate, which triggered a discussion about whether the default should change (with @MaskRay and @arsenm in favour of changing that default). I see the advantages of changing the default, but would also be concerned about the impact and churn in doing so. I’m raising this RFC as requested but honestly I’m fairly on the fence (it feels like either way, it’s easy to produce a backtrace when it’s not appropriate or fail to do so when you should have), so I’d look for others with stronger views on the matter to make their case more strongly.
There are two cases that report_fatal_error is used:
Error handling in a backend that’s not useful/practical to handle in a better way (handling hand-written MIR that breaks invariants or something). These are (usually) not a bug and shouldn’t emit a crash report
assert-even-on-no-asserts behaviour - ie, panic and crash.
We should generally avoid adding cases like (2) anyway, so changing the default probably makes sense. The problem with that is of course that any existing instances of (2) might start looking like (1) - should we be concerned about that?
edit: cases of (1) aren’t necessarily all in backends
I may be out of date here, but my understanding of the behavior is that the stacktrace isn’t tied to report_fatal_error at all - report_fatal_error simply runs the fatal_error_handler, so it is up to the tool to decide whether to install one or not.
The issue here is probably confusion around when and who installs the fatal error handler - my understanding is that InitLLVM installs a sigabort handler which prints a stacktrace, but that should only be used by LLVM compiler-hacker tools (and perhaps llvm.org tools like clang), not by libraries or 3rd party tools.
I’m failing to understand when an ICE is “desirable” in a way that isn’t a bug and shouldn’t be treated as such?
(Edit: of course the exact message to the user should be tweaked based on the product integration as Chris mentions…)
Similarly, in the case where there’s no handler installed, the gen_crash_diag parameter decides whether we abort() or exit(1). Currently report_fatal_error defaults to the crash diag behaviour, but you can explicitly ask for gen_crash_diag = false.
The question here is whether we should change the default, since in many cases a backtrace isn’t helpful when report_fatal_error is used.
I think if some using an end-user tool like clang sees an error message that doesn’t provide enough context to act on it (no location information), we should consider it a bug. We want the bug report so we can improve the error handling.
Similarly, if third-party software using LLVM as a library triggers a fatal error, they probably want their bug report.
In almost all cases, report_fatal_error is not producing error messages with sufficient context, so I think asking for a bug report is the right default.
@clattner: I think @bogner’s response summarises the situation correctly.
@mehdi_amini and @efriedma-quic: Maybe we’re talking past each other here, but I think almost every instance of report_fatal_error is unfortunate (to the point I disagree that it makes sense to encourage a bug report in each instance as @efriedma-quic suggests - if we want to know cases where error handling is substandard, we can just grep for report_fatal_error). But due to historic design decisions and the interfaces available in LLVM today, in some cases it’s just not feasible to do something more meaningful than flagging an illegal situation occurred and exiting. It would be great to refactor and improve LLVM so that isn’t the case, but I think we also need to recognise the reality of where we are today.
Maybe you’re looking only at the LLVM backend, but I would hope that none of the instance of report_fatal_error in MLIR for example is subject to be trigger on user-input (they are meant to fire to catch issue for the people “assembling the compiler” and not for people “using the compiler”), and if they trigger: it is a bug!
We’re talking about the default here I believe, and whenever I’m using report_fatal_error, I really mean that this “should not happen” (it’s stronger than assertions…) and to me this is just the better default behavior: this cannot be a blessed way to do error handling, regardless of the history in LLVM backends…
If you work on some piece of LLVM that should have better error reporting, I can only encourage you to refactor it as such, or be explicit about the non-standard intent.
Thanks for that perspective Medhi - I think perhaps it hasn’t been considered much that in newer part of the LLVM monorepo, the current default may make more sense.
Honestly I don’t strongly disagree here - as I said in the first post I’m starting the RFC by request, but others involved in the relevant review threads had stronger views on the merits of changing the default.
@maskray@arsenm - does the fact that the current default makes more sense for e.g. MLIR change your view at all?
It’s something that would obviously have benefit to LLVM, but reworking APIs affecting all backends and multiple frontend tools in order to try to bubble errors up on a meaningful way can be a non-trivial project. Maybe we should consider listing it on the LLVM open projects page - though I’m not sure how accessible it would be to newcomers.
For what it’s worth, I agree that the current report_fatal_error default is quite often wrong at least in LLVM – basically, whenever it is possible to trigger the error condition via opt or llc, it shouldn’t be using GenCrashDialog=true. People can and will open issues for these, even if it’s obviously their own fault. I have repeatedly “fixed” such issues by flipping to GenCrashDialog=false.
At the same time, I’m not sure whether changing the default makes sense, as there are also many uses of report_fatal_error that do want to report a crash. Those are the cases where report_fatal_error is basically being used as an always-on assertion.
I would probably go for splitting report_fatal_error into two separate functions, such that an explicit choice is made on which to use, rather than going with whatever the default behavior is. For example report_invalid_input/report_invalid (GenCrashDialog=false) and report_bug (GenCrashDialog=true). report_fatal_error could be retained for now, but discouraged.
Two different functions would work here. We could also consider just getting rid of the default and making every caller be explicit about it if naming is hard.
+1 to the idea of two functions. For the areas of LLVM I’ve maintained in the past and continue to review on (primarily the LLVM binutils and their immediate libraries like Object), I’ve rarely, if ever, seen a report_fatal_error that wasn’t indicating a bug if it fired. I was going to chime in on this thread suggesting changing report_fatal_error to not have a default at all, but two functions is probably clearer.
Not only do I think the default should be changed, but that it should not even be an option. The behavior before ~3 years ago was correct and just exited the process and printed the message with no backtrace. I only expect to see a backtrace in an unstructured crash or assert. A recognized error condition is an exit with some information, even if you would consider hitting the error to be a “bug”
I agree with splitting report_fatal_error and that the GenCrashDiaglot default should not be changed. A new function can be added to pass GenCrashDiaglot=false to report_fatal_error.
I see this as creating a lot of churn. Nearly all, if not 100% of existing call sites should not behave as an assert and not produce the backtrace. I think the option should just be removed, a separate function can be introduced for the assert-like case