[RFC] Error handling in LLVM libraries.

Hi Mehdi,

Side question on the related work because I’m curious: did you look for similar generic scheme for error handling offered by other C++ libraries? Maybe the common scheme is to use C++ exceptions but since many folks disable them (hello Google :wink: ) there has been probably many other attempts to address this.

I did look around, but not very hard. Among the people I asked, everyone was either using exceptions, std::error_code, or had turned on the diagnostic that James Knight suggested. I did some preliminary web-searching, but that didn’t turn up anything interesting. If anybody has seen other error handling schemes of note I’d love to hear about them.

On this topic of not paying the price on the non-error code path, it would be nice to not pay for constructing all the lambda captures when there is no error.

If your catchTypedErrors call is under an if-test then the lambdas are in a scope that won’t be entered on the non-error path:

if (auto Err = foo())
if (auto E2 = catchTypedErrors(std::move(Errs), ))
return;

Sure, but I was looking at avoiding to have to do this extra level of manual test, to reduce the “boilerplate” effect. For instance I’d like to write something like (straw man):

TRY(foo())
CATCH(lambda1 here)
CATCH(lambda2 here)

CATCH(lambda3 here)

or (straw man as well):

auto Err = foo();
HANDLE(Err) {
MATCH(MyCustomType) {
// code
}
MATCH(MyOtherCustomType) {
// code
}
DEFAULT {

}
}

I think (though I haven’t tested this) that most lambdas should inline away to next to nothing, so I don’t expect the overhead to be noticeable in either case.

My guess is that is will be very dependent on what the lambdas is actually capturing and how many there are, and if catchTypedErrors is actually inlined.
But I’m sure we would sort this out anyway if it turns out they’re not :wink:

Hi Mehdi,

Separate from the discussion on how we represent errors, we need to decide what errors we want to detect, which we try to diagnose, and which are recoverable. Each of these is a distinct design decision. For instance, I’d be really curious to hear more about your use case in Orc and MCJIT. We’re using MCJIT and I’ve found the error model presented to be perfectly adequate for our purposes. I’ve sometimes wished that MCJIT did a slightly better job diagnosing failures, but I have yet to see a case where I’d actually want the compiler to recover from incorrect input rather than failing with a clean message so that I can fix the bug in the calling code. One thing I’m seriously concerned about is setting false expectations. Unless we are actually going to invest in supporting recovery from (say) malformed input, we should not present an interface which seems to allow that possibility. A hard failure is much preferable to a soft error with the library in a potentially corrupt state. I have seen several software projects head down the road of soft errors with partial recovery and it’s always a disaster. (Your proposal around checked-by-default errors is a good step in this direction, but is not by itself enough.) Philip p.s. I’m purposely not commenting on the representation question because I haven’t read the proposal in enough detail to have a strong opinion.

Hi Philip,

Separate from the discussion on how we represent errors, we need to decide what errors we want to detect, which we try to diagnose, and which are recoverable. Each of these is a distinct design decision.

Absolutely. This proposal should be interpreted with that in mind - it gives us a better toolkit for describing certain kinds of errors, but it should not be shoehorned into situations where it does not fit. Programmatic errors are an obvious example: they should remain hard errors (asserts/report_fatal_error/llvm_unreachable).

On MCJIT/Orc - I’m glad the current situation hasn’t caused you too much pain, but other clients have complained about it. I think the key observation is that neither MCJIT nor ORC assume that the bit code they’re compiling is in-memory. That means that both should be able to report “resource not available” errors (e.g. when JITing bitcode off a network mount that has temporarily dropped out). This problem has become especially acute now that we have library support for remote-JITing: Any call to any lazily compiled function may fail due to resource drop-outs, and we should be able to report that to the user to give them a chance to try to recover and re-try compilation.

On the commitment front: The driving use case for me currently is the llvm object-file tools and LLD, but I know that proper error handling in the JIT is something I want to tackle in not-too-distant future, so having a toolkit to deal with it is a big step in the right direction.

  • Lang.

Oops…

“I think the key observation is that neither MCJIT nor ORC assume that the bit code they’re compiling is in-memory.”

IR is obviously always in memory. This should be “neither MCJIT nor ORC can assume that the objects being linked are in-memory”.

  • Lang.

The main thing I like about the diagnostic system is that it lets us
differentiate two related but independent concepts:

* Giving the human using the program diagnostics about what went wrong.
* Propagating an error to the caller so that the upper library layer
can handle it or pass it up the stack.

For example, when given a broken bitcode file we are able to produce
the diagnostic "Unknown attribute kind (52)" which shows exactly what
we are complaining about. We are able to do that because the
diagnostic is produced close to the spot where the error is detected.

That is way more than what we need to pass to the caller. In fact, the
only cases we need to differentiate are "this is not a bitcode file at
all" and "the file is broken", which we do with the following enum +
std::error_code.

  enum class BitcodeError { InvalidBitcodeSignature = 1, CorruptedBitcode };

So we use the context to produce a diagnostic, but the error doesn't
need to store it. Compare that with what we had before r225562.

Note that with error_code one can define arbitrary error categories.

Other advantages are:

* It is easy to use fatal error in simple programs by just using a
diganostic handler that exits.
* Very debugger friendly. It is easy to put a breakpoint at the diag handler.

Given that distinction, what I agree that is really missing is
infrastructure to make sure std::error_code is handled and for
filtering it.

So, could you achieve your objectives by:

* Moving the diagnostic handling code to Support so that all of llvm can use it.
* Defining TypedError as a wrapper over std::error_code with a
destructor that verifies the error was handled?

Thanks,
Rafael

> Hi Mehdi,
>
>> If you subclass a diagnostic right now, isn’t the RTTI information
>> available to the handler, which can then achieve the same dispatching /
>> custom handling per type of diagnostic?
>> (I’m not advocating the diagnostic system, which I found less convenient
>> to use than what you are proposing)
>
> I have to confess I haven't looked at the diagnostic classes closely.
I'll
> take a look and get back to you on this one. :slight_smile:

The main thing I like about the diagnostic system is that it lets us
differentiate two related but independent concepts:

* Giving the human using the program diagnostics about what went wrong.
* Propagating an error to the caller so that the upper library layer
can handle it or pass it up the stack.

For example, when given a broken bitcode file we are able to produce
the diagnostic "Unknown attribute kind (52)" which shows exactly what
we are complaining about. We are able to do that because the
diagnostic is produced close to the spot where the error is detected.

That is way more than what we need to pass to the caller. In fact, the
only cases we need to differentiate are "this is not a bitcode file at
all" and "the file is broken", which we do with the following enum +
std::error_code.

  enum class BitcodeError { InvalidBitcodeSignature = 1, CorruptedBitcode
};

So we use the context to produce a diagnostic, but the error doesn't
need to store it. Compare that with what we had before r225562.

Note that with error_code one can define arbitrary error categories.

Other advantages are:

* It is easy to use fatal error in simple programs by just using a
diganostic handler that exits.
* Very debugger friendly. It is easy to put a breakpoint at the diag
handler.

Yeah, the ability to breakpoint in the diag handler is a huge win!

Given that distinction, what I agree that is really missing is
infrastructure to make sure std::error_code is handled and for
filtering it.

So, could you achieve your objectives by:

* Moving the diagnostic handling code to Support so that all of llvm can
use it.

Do you mean everything from clang and LLVM? Or just the stuff in LLVM?
The clang stuff would be a lot of work, but we did something similar for
libOption when it started becoming clear that multiple programs would
benefit from it (in particular LLD).

-- Sean Silva

Hi Rafael,

The main thing I like about the diagnostic system is that it lets us
differentiate two related but independent concepts:

  • Giving the human using the program diagnostics about what went wrong.
  • Propagating an error to the caller so that the upper library layer
    can handle it or pass it up the stack.

I don’t think these are really independent. Whether or not you need to emit a diagnostic depends on whether a caller can handle the corresponding error, which isn’t something you know at the point where the error is raised. That’s the idea behind the ‘log’ method on TypedErrorInfo: It lets you transform meaningful information about the problem into a log message after you know whether it can be handled or not. Recoverable errors can be logged (if the client wants to do so), but they don’t have to be.

Using TypedError for diagnostics also means one fewer friction point when composing library code: Rather than having to agree on both the error return type and the diagnostic context type you only have to get agreement on the error return type.

That is way more than what we need to pass to the caller. In fact, the
only cases we need to differentiate are “this is not a bitcode file at
all” and “the file is broken”, which we do with the following enum +
std::error_code.

By contrast, in my system this would be:

class InvalidBitcodeSignature : TypedErrorInfo {};
class CorruptedBitcode : TypedErrorInfo {
public:
CorruptedBitcode(std::string Msg) : Msg(Msg) {}
void log(raw_ostream &OS) const override { OS << Msg; }
private:
std::string Msg;
};

Once you factor out the need to define an error category, I suspect my system actually requires less code, not that either of them require much.

Note that with error_code one can define arbitrary error categories.

The problem with error categories is that they’re limited to static groupings of kinds of errors, but the user might need dynamic information to recover.

Consider someone who wants to use libObject to write an object-file repair tool: A “bad_symbol” error code tells you what went wrong, but not where or why, so it’s not much use in attempting a fix. On the other hand “bad_symbol { idx = 10 }” gives you enough information to pop up a dialog, ask the user what the symbol at index 10 should be, then re-write that symbol table entry.

Other advantages are:

  • It is easy to use fatal error in simple programs by just using a
    diganostic handler that exits.
  • Very debugger friendly. It is easy to put a breakpoint at the diag handler.

This proposal provides the same advantages. I noted the second point in the original RFC. The first is easily implemented using an idiom I’ve seen in llvm-objdump and elsewhere:

void check(TypedError Err) {
if (!Err)
return;

logAllUnhandledErrors(std::move(Err), errs(), “”);
exit(1);
}


TypedErrorOr R = doSomething();
check(R.takeError());
… *R;

Side-note: you can actually go one better and replace this idiom with:

template
T check(TypedErrorOr ValOrErr) {
if (!ValOrErr) {
logAllUnhandledErrors(ValOrErr.takeError(), errs(), “”);
exit(1);
}
return std::move(*ValOrErr);
}


Result R = check(doSomething());

Mind you, this new idiom works equally well for ErrorOr/std::error_code.

So, could you achieve your objectives by:

  • Moving the diagnostic handling code to Support so that all of llvm can use it.
  • Defining TypedError as a wrapper over std::error_code with a
    destructor that verifies the error was handled?

Using TypedError as a wrapper for std::error_code would prevent errors from be silently dropped, which is a huge step forward. (Though TypedError would be a misnomer - we’d need a new name. :slight_smile:

Using a diagnostic handler would allow richer diagnostics from libObject, but it supports a narrower class of error-recovery than my proposal. I’m inclined to favor my proposal as a strictly more general solution.

Cheers,
Lang.

* Moving the diagnostic handling code to Support so that all of llvm can
use it.

Do you mean everything from clang and LLVM? Or just the stuff in LLVM?
The clang stuff would be a lot of work, but we did something similar for
libOption when it started becoming clear that multiple programs would
benefit from it (in particular LLD).

Well, everything *could* use it, but things like clang would probably
stay with the more specialized tools they have.

Cheers,
Rafael

Hi Sean,

Yeah, the ability to breakpoint in the diag handler is a huge win!

I think both schemes support that equally well. In my scheme, if you have an error MyError, you can just write:

(lldb) b MyError

to set a breakpoint on the constructor.

  • Lang.

I don't think these are really independent. Whether or not you need to emit
a diagnostic depends on whether a caller can handle the corresponding error,
which isn't something you know at the point where the error is raised.

But you do in the diag handler. For example, if you are trying to open
multiple files, some of which are bitcode, you know to ignore
InvalidBitcodeSignature.

That's the idea behind the 'log' method on TypedErrorInfo: It lets you
transform meaningful information about the problem into a log message
*after* you know whether it can be handled or not. Recoverable errors can be
logged (if the client wants to do so), but they don't have to be.

...

By contrast, in my system this would be:

class InvalidBitcodeSignature : TypedErrorInfo<InvalidBitcodeSignature> {};
class CorruptedBitcode : TypedErrorInfo<CorruptedBitcode> {
public:
  CorruptedBitcode(std::string Msg) : Msg(Msg) {}
  void log(raw_ostream &OS) const override { OS << Msg; }
private:
  std::string Msg;
};

This highlights why I think it is important to keep diagnostics and
errors separate. In the solution you have there is a need to allocate
a std::string, even if that is never used. If it was always a
std::string it would not be that bad, but the framework seems designed
to allow even more context to be stored, opening the way for fun cases
of error handling interfering with lifetime management.

Consider someone who wants to use libObject to write an object-file repair
tool: A "bad_symbol" error code tells you what went wrong, but not where or
why, so it's not much use in attempting a fix. On the other hand "bad_symbol
{ idx = 10 }" gives you enough information to pop up a dialog, ask the user
what the symbol at index 10 should be, then re-write that symbol table
entry.

Using error results to find deep information you want to patch seems
like a bad idea. A tool that wants to patch symbol indexes should be
reading them directly.

Other advantages are:

* It is easy to use fatal error in simple programs by just using a
diganostic handler that exits.
* Very debugger friendly. It is easy to put a breakpoint at the diag
handler.

This proposal provides the same advantages. I noted the second point in the
original RFC. The first is easily implemented using an idiom I've seen in
llvm-objdump and elsewhere:

void check(TypedError Err) {
  if (!Err)
    return;

  logAllUnhandledErrors(std::move(Err), errs(), "<tool name>");
  exit(1);
}

That is not the same that you get with a diagnostic handler. What you
get is an exit after the error was propagated over the library layer,
instead of an exit at the point where the issue is found.

We use the above code now because lib/Object has no diagnostic handler.

Side-note: you can actually go one better and replace this idiom with:

template <typename T>
T check(TypedErrorOr<T> ValOrErr) {
  if (!ValOrErr) {
    logAllUnhandledErrors(ValOrErr.takeError(), errs(), "<tool name>");
    exit(1);
  }
  return std::move(*ValOrErr);
}

Yes, we do pretty much that in ELF/COFF lld.

Using TypedError as a wrapper for std::error_code would prevent errors from
be silently dropped, which is a huge step forward. (Though TypedError would
be a misnomer - we'd need a new name. :slight_smile:

CheckedError?

Using a diagnostic handler would allow richer diagnostics from libObject,
but it supports a narrower class of error-recovery than my proposal. I'm
inclined to favor my proposal as a strictly more general solution.

I would prefer the diagnostic handler for exactly the same reason :slight_smile:
Generality has a cost, and I don't think we should invest on it until
we have a concrete case where that is needed. Given that I would
suggest going the more incremental way. Add a CheckedError that wraps
error_code and use to make sure the errors are checked. When a better
diagnostic is needed, pass in a diagnostic handler.

If we ever get to the point were we really want to *return* more than
an error_code, it will be a much smaller refactoring to
s/CheckedError/TypedError/.

Cheers,
Rafael

I don’t think these are really independent. Whether or not you need to emit
a diagnostic depends on whether a caller can handle the corresponding error,
which isn’t something you know at the point where the error is raised.

But you do in the diag handler. For example, if you are trying to open
multiple files, some of which are bitcode, you know to ignore
InvalidBitcodeSignature.

That means anticipating the kinds of errors you do/don’t want to recover from at the point at which you insatiate your diagnostic handler. At that point you may not have the information that you need to know whether you want to recover. E.g. What if one path through library code can recover, and another can’t? Worse still, what if the divergence point between these two paths is in library code itself? Now you need to thread two diagnostic handlers up to that point. This scales very poorly.

Simply put: diagnostic handlers aren’t library friendly. It’s tough enough to get all libraries to agree on an error type, without having them all agree on a diagnostic handlers too, and thread them everywhere.

This highlights why I think it is important to keep diagnostics and
errors separate. In the solution you have there is a need to allocate
a std::string, even if that is never used.

Errors are only constructed on the error path. There is no construction on the success path.

If it was always a
std::string it would not be that bad, but the framework seems designed
to allow even more context to be stored, opening the way for fun cases
of error handling interfering with lifetime management.

This is a real problem, but I think the solution is to have a style guideline: You can’t use reference types (in the general sense - references, pointers, etc.) in errors. Errors that would otherwise require references should have their error-message stringified at the point of creation.

Consider someone who wants to use libObject to write an object-file repair
tool: A “bad_symbol” error code tells you what went wrong, but not where or
why, so it’s not much use in attempting a fix. On the other hand “bad_symbol
{ idx = 10 }” gives you enough information to pop up a dialog, ask the user
what the symbol at index 10 should be, then re-write that symbol table
entry.

Using error results to find deep information you want to patch seems
like a bad idea. A tool that wants to patch symbol indexes should be
reading them directly.

I’m not sure I agree, but I don’t have strong feelings on this particular example - it was intended as a straw man. My point is that error recover is useful in general: there is a reason things like exceptions exist. Diagnostic handlers are very poorly suited to general error recovery.

void check(TypedError Err) {
if (!Err)
return;

logAllUnhandledErrors(std::move(Err), errs(), “”);
exit(1);
}

That is not the same that you get with a diagnostic handler. What you
get is an exit after the error was propagated over the library layer,
instead of an exit at the point where the issue is found.

As long as you log the error I’m not sure I see a meaningful difference? If you’re not debugging you shouldn’t be crashing in your library. If you are debugging you want to set a breakpoint on the error constructor instead.

I would prefer the diagnostic handler for exactly the same reason :slight_smile:
Generality has a cost…

Could you explain what you see as the cost in this instance?

The scheme has a higher cost that std::error_code alone when you’re returning an error, but I don’t think optimizing the error case is a sensible thing to do. It has a potentially lower cost on the success path, as you don’t need to take up an argument for the diagnostic handler.

The most serious objection that you’ve raised, to my mind, is the potential lifetime issues if people mistakenly use references in their error types. I’m planning to write additions to llvm’s coding guidelines to cover that, which I think should be sufficient.

On the flip side, compared to a diagnostic handler, I think the scheme I’ve proposed has the following benefits:

  • No need to thread a diagnostic handler type through the world (and consequently no need to agree on its type). Retrofitting rich error logging to an interface that already returns errors requires less work than adding a diagnostic handler.
  • The ability to describe error hierarchies (e.g. the class of all errors that represent malformed objects, which may be further subclassed to describe specific errors).
  • The ability for client code to meaningfully distinguish between error instances, rather than just error types.

, and I don’t think we should invest on it until

we have a concrete case where that is needed.

The error infrastructure investment is largely done, and I’m volunteering to maintain it. Myself and Kevin Enderby are signing up to weave this through libObject and the JIT. Other libraries could be converted as people see fit, with ECError providing an interface between the std::error_code world and TypedError.

Given that I would
suggest going the more incremental way. Add a CheckedError that wraps
error_code and use to make sure the errors are checked. When a better
diagnostic is needed, pass in a diagnostic handler.

Our first use-case for this system is libObject, where we want to use this to enable richer error messages. An incremental approach would involve threading a diagnostic handler through everything (and returning CheckedError), then un-threading it again would involve a lot of churn. I’d prefer to move directly to the scheme I’m proposing.

Cheers,
Lang.

> > I don't think these are really independent. Whether or not you need to
emit
> > a diagnostic depends on whether a caller can handle the corresponding
error,
> > which isn't something you know at the point where the error is raised.

> But you do in the diag handler. For example, if you are trying to open
> multiple files, some of which are bitcode, you know to ignore
> InvalidBitcodeSignature.

That means anticipating the kinds of errors you do/don't want to recover
from at the point at which you insatiate your diagnostic handler. At that
point you may not have the information that you need to know whether you
want to recover. E.g. What if one path through library code can recover,
and another can't?

+1 to this (I had the same thought - stateful handlers would be pretty
tricky, doubly so with \/ this point)

Worse still, what if the divergence point between these two paths is in
library code itself? Now you need to thread two diagnostic handlers up to
that point. This scales very poorly.

Simply put: diagnostic handlers aren't library friendly. It's tough enough
to get all libraries to agree on an error type, without having them all
agree on a diagnostic handlers too, and thread them everywhere.

I think it depends a bit on the library & how generic it is - likely lower
level libraries (like libObject) will want a granular error-based result,
not a diagnostic engine. (the clients will be more widely varied (than,
say, Clang) & have different diagnostic/behavioral needs, etc)

But I think it makes total sense for Clang to use a diagnostic handling
approach for the most part (all the possible ways that C++ can be written
incorrectly essentially amount to the same thing for a consumer of Clang -
code was bad & I need some text (and fixits, notes, etc) to tell the user
why), the IR verifier similarly - and possibly LLD, for example, depending
on how broad the use cases become.

Hi David,

Hi Rafael, Lang

This highlights why I think it is important to keep diagnostics and
errors separate. In the solution you have there is a need to allocate
a std::string, even if that is never used.

Errors are only constructed on the error path. There is no construction on
the success path.

But they are always created, even if it as error the caller wants to
ignore. For example, you will always create a "file foo.o in bar.a is
not a bitcode" message (or copy sufficient information for that to be
created). With a dignostic handler no copying is needed, since the
call happens in the context where the error is found. It is easy to
see us in a position where a lot of context is copied because some
client somewhere might want it.

So I am worried we are coding for the hypothetical and adding
complexity. In particular, if we are going this way I think it is
critical that your patch *removes* the existing diagnostic handlers
(while making sure test/Bitcode/invalid.ll still passes) so that we
don't end up with two overlapping solutions. We were still not even
done converting away from bool+std::string :frowning:

Cheers,
Rafael

>> This highlights why I think it is important to keep diagnostics and
>> errors separate. In the solution you have there is a need to allocate
>> a std::string, even if that is never used.
>
> Errors are only constructed on the error path. There is no construction
on
> the success path.

But they are always created, even if it as error the caller wants to
ignore. For example, you will always create a "file foo.o in bar.a is
not a bitcode" message (or copy sufficient information for that to be
created). With a dignostic handler no copying is needed, since the
call happens in the context where the error is found. It is easy to
see us in a position where a lot of context is copied because some
client somewhere might want it.

So I am worried we are coding for the hypothetical and adding
complexity. In particular, if we are going this way I think it is
critical that your patch *removes* the existing diagnostic handlers
(while making sure test/Bitcode/invalid.ll still passes) so that we
don't end up with two overlapping solutions.

Removes diagnostic handlers in other parts of LLVM (& Clang) - the IR
verifier's diagnostic handling is what you're referring to here ^?

I don't think that would be an improvement - it seems like different
situations call for different tools (as I was saying yesterday on this
thread). In some places a diagnostic handler is the right tool, and in some
places error codes/results/etc are the right tool. We already live in that
world & it seems like a reasonable one (there doesn't seem to be a
fundamental conflict between our bool+std::string or error_code returns and
existing diagnostic handlers - I think they can reasonably coexist in
different parts of the codebase for different use cases)

In which case we do need a plain checked error_code. Right now we use
a diagnostic handler in the BitcodeReader, but it is really easy to
miss propagating an error out. We shouldn't be in a position where we
have to decide to use an diagnostic handler or have checked errors. It
should be possible to have both if it is not desired that the new
error handling replaces the use of diagnostic handling.

Cheers,
Rafael

Sorry for the question out of nowhere, but what happened to ErrorOr<> scheme?

ErrorOr still exists, and generally useful. This proposal introduces a similar class, TypedErrorOr, that replaces the std::error_code with a TypedError.

Any more detailed discussion of ErrorOr usage should probably happen on another thread - this one is already getting long. :slight_smile:

  • Lang.