[RFC] report_fatal_error and the default value of GenCrashDialog

As documented in the programmers manual, wherever possible we want to use recoverable errors that (if necessary) are presented to the user in an appropriate way. However, we don’t always have the infrastructure available to do this and so there are certain places in the backend where there’s not much option but to report_fatal_error. By default this prints a backtrace and a suggestion to file a bug report, behaviour which is controlled by the GenCrashDialog parameter. In many cases, this isn’t appropriate and so it is better to set it to false (i.e. the fact the erroneous condition was encountered isn’t a bug - it’s just a situation where compilation can’t continue).

In D150674. I tried to convert various report_fatal_error callsitse to disable the crash dialog where it wasn’t appropriate, which triggered a discussion about whether the default should change (with @MaskRay and @arsenm in favour of changing that default). I see the advantages of changing the default, but would also be concerned about the impact and churn in doing so. I’m raising this RFC as requested but honestly I’m fairly on the fence (it feels like either way, it’s easy to produce a backtrace when it’s not appropriate or fail to do so when you should have), so I’d look for others with stronger views on the matter to make their case more strongly.

There are two cases that report_fatal_error is used:

  1. Error handling in a backend that’s not useful/practical to handle in a better way (handling hand-written MIR that breaks invariants or something). These are (usually) not a bug and shouldn’t emit a crash report
  2. assert-even-on-no-asserts behaviour - ie, panic and crash.

We should generally avoid adding cases like (2) anyway, so changing the default probably makes sense. The problem with that is of course that any existing instances of (2) might start looking like (1) - should we be concerned about that?

edit: cases of (1) aren’t necessarily all in backends

I may be out of date here, but my understanding of the behavior is that the stacktrace isn’t tied to report_fatal_error at all - report_fatal_error simply runs the fatal_error_handler, so it is up to the tool to decide whether to install one or not.

The issue here is probably confusion around when and who installs the fatal error handler - my understanding is that InitLLVM installs a sigabort handler which prints a stacktrace, but that should only be used by LLVM compiler-hacker tools (and perhaps llvm.org tools like clang), not by libraries or 3rd party tools.

Is this what you’re seeing?

-Chris

1 Like

I’m failing to understand when an ICE is “desirable” in a way that isn’t a bug and shouldn’t be treated as such?
(Edit: of course the exact message to the user should be tweaked based on the product integration as Chris mentions…)

The fatal_error_handler takes a parameter for whether to generate a crash dialog:

  typedef void (*fatal_error_handler_t)(void *user_data,
                                        const char *reason,
                                        bool gen_crash_diag);

Similarly, in the case where there’s no handler installed, the gen_crash_diag parameter decides whether we abort() or exit(1). Currently report_fatal_error defaults to the crash diag behaviour, but you can explicitly ask for gen_crash_diag = false.

The question here is whether we should change the default, since in many cases a backtrace isn’t helpful when report_fatal_error is used.

I think if some using an end-user tool like clang sees an error message that doesn’t provide enough context to act on it (no location information), we should consider it a bug. We want the bug report so we can improve the error handling.

Similarly, if third-party software using LLVM as a library triggers a fatal error, they probably want their bug report.

In almost all cases, report_fatal_error is not producing error messages with sufficient context, so I think asking for a bug report is the right default.

4 Likes

@clattner: I think @bogner’s response summarises the situation correctly.

@mehdi_amini and @efriedma-quic: Maybe we’re talking past each other here, but I think almost every instance of report_fatal_error is unfortunate (to the point I disagree that it makes sense to encourage a bug report in each instance as @efriedma-quic suggests - if we want to know cases where error handling is substandard, we can just grep for report_fatal_error). But due to historic design decisions and the interfaces available in LLVM today, in some cases it’s just not feasible to do something more meaningful than flagging an illegal situation occurred and exiting. It would be great to refactor and improve LLVM so that isn’t the case, but I think we also need to recognise the reality of where we are today.

Maybe you’re looking only at the LLVM backend, but I would hope that none of the instance of report_fatal_error in MLIR for example is subject to be trigger on user-input (they are meant to fire to catch issue for the people “assembling the compiler” and not for people “using the compiler”), and if they trigger: it is a bug!

We’re talking about the default here I believe, and whenever I’m using report_fatal_error, I really mean that this “should not happen” (it’s stronger than assertions…) and to me this is just the better default behavior: this cannot be a blessed way to do error handling, regardless of the history in LLVM backends…

If you work on some piece of LLVM that should have better error reporting, I can only encourage you to refactor it as such, or be explicit about the non-standard intent.

2 Likes

Thanks for that perspective Medhi - I think perhaps it hasn’t been considered much that in newer part of the LLVM monorepo, the current default may make more sense.

Honestly I don’t strongly disagree here - as I said in the first post I’m starting the RFC by request, but others involved in the relevant review threads had stronger views on the merits of changing the default.

@maskray @arsenm - does the fact that the current default makes more sense for e.g. MLIR change your view at all?

It’s something that would obviously have benefit to LLVM, but reworking APIs affecting all backends and multiple frontend tools in order to try to bubble errors up on a meaningful way can be a non-trivial project. Maybe we should consider listing it on the LLVM open projects page - though I’m not sure how accessible it would be to newcomers.

For what it’s worth, I agree that the current report_fatal_error default is quite often wrong at least in LLVM – basically, whenever it is possible to trigger the error condition via opt or llc, it shouldn’t be using GenCrashDialog=true. People can and will open issues for these, even if it’s obviously their own fault. I have repeatedly “fixed” such issues by flipping to GenCrashDialog=false.

At the same time, I’m not sure whether changing the default makes sense, as there are also many uses of report_fatal_error that do want to report a crash. Those are the cases where report_fatal_error is basically being used as an always-on assertion.

I would probably go for splitting report_fatal_error into two separate functions, such that an explicit choice is made on which to use, rather than going with whatever the default behavior is. For example report_invalid_input/report_invalid (GenCrashDialog=false) and report_bug (GenCrashDialog=true). report_fatal_error could be retained for now, but discouraged.

3 Likes

As an interim step one could have a per-component define for what the default is.

Two different functions would work here. We could also consider just getting rid of the default and making every caller be explicit about it if naming is hard.

1 Like

+1 to the idea of two functions. For the areas of LLVM I’ve maintained in the past and continue to review on (primarily the LLVM binutils and their immediate libraries like Object), I’ve rarely, if ever, seen a report_fatal_error that wasn’t indicating a bug if it fired. I was going to chime in on this thread suggesting changing report_fatal_error to not have a default at all, but two functions is probably clearer.

Not only do I think the default should be changed, but that it should not even be an option. The behavior before ~3 years ago was correct and just exited the process and printed the message with no backtrace. I only expect to see a backtrace in an unstructured crash or assert. A recognized error condition is an exit with some information, even if you would consider hitting the error to be a “bug”

Sorry for the late reply.

I agree with splitting report_fatal_error and that the GenCrashDiaglot default should not be changed. A new function can be added to pass GenCrashDiaglot=false to report_fatal_error.

For ⚙ D150674 [RISCV] Set GenCrashDialog=false for various report_fatal_error calls in lib/Target/RISCV , the call sites that indicate user errors can be switched to the new function.

1 Like

I see this as creating a lot of churn. Nearly all, if not 100% of existing call sites should not behave as an assert and not produce the backtrace. I think the option should just be removed, a separate function can be introduced for the assert-like case

Reviving this thread since it came up again in [Support] report_fatal_error: Do not generate crash backtrace by default by optimisan · Pull Request #128495 · llvm/llvm-project · GitHub.

I really think the LLVM usage of report_fatal_error is far more standard, just based on the name of the function: it clearly means report an error to the user and terminate the process.

Using it for “should not happen” is the non-standard intent. That should be some form of assertion. I don’t understand what you mean by “it’s stronger than assertions” – surely that can be handled by having different flavours of assertion function: assert_with_diagnostic or assert_with_backtrace or whatever.

I’m strongly in favour of #128495 or something like it.

2 Likes

Did you mean the opposite? I am not sure I understand well what you mean.

I disagree: LLVM is a library and terminating the process is not an acceptable handling of “reporting an error”, this condition is a library bug.
Hence my quote from earlier:

User-diagnostic of “errors” that aren’t library bug should trigger a diagnostic and percolate up without crashing the process (at the user API boundary, a user should have the choice of handling this however they wish, including aborting if this is what is suitable for them).

I think (it was 2 ago) I meant it is stronger than assertions in that it is not stripped out of the release.

Support is a library used by libraries and tools. report_fatal_error is for low effort tools, and a few library cases where it’s simply not worth propagating the error up to user tools and will only ever be encountered by llvm developers.

There is a spectrum of cases where it’s not feasible or worth it to do this. Not every error is a product of a clear API boundary, or can be be reported in a useful way. For example the SelectionDAG legalization failures are essentially an assert, but also need to be slightly stronger than an assert and always fire in release builds. It would be nontrivial engineering work (since you must produce some code in the end) to produce a valid result of the correct shape along with a user diagnostic. And the diagnostic is ultimately reporting a missing feature or bug in codegen, which is not actionable by the end user.

If this was the case that would be fine, however I don’t believe that description matches the existing practice: git grep report_fatal_error shows ~1856 occurrences of it right now, and a quick scan shows that many of them are deep in library code, far from “low effort tools”.

Two things here:

  • you don’t need to produce a valid result, you could also return an error to the user to signal that the API failed,
  • if really you end up in such a case, you can opt-in in the behavior you want: the fact that such cases exists does not seem like an argument for changing the default.