lld: sigbus error handling

If your system does not support fallocate(2), we use ftruncate(2) to create an output file. fallocate(2) succeeds even if your disk have less space than the requested size, because it creates a sparse file. If you mmap such sparse file, you’ll receive a SIGBUS when the disk actually becomes full.

So, lld can die suddenly with SIGBUS when your disk becomes full, and currently we are not doing anything about it. It’s sometimes hard to notice that that was caused by the lack of disk space.

I wonder if we should print out a hint (e.g. “Bus error – disk full?”) when we receive a SIGBUS. Any opinions?

Sounds like a good idea to me. Please also include “SIGBUS” in the message. The SIGBUS signal can have various causes, but the disk full cause is both non-obvious and probably common, so the hint seems useful.

– Bob

If your system does not support fallocate(2), we use ftruncate(2) to
create an output file. fallocate(2) succeeds even if your disk have less
space than the requested size, because it creates a sparse file. If you
mmap such sparse file, you'll receive a SIGBUS when the disk actually
becomes full.

So, lld can die suddenly with SIGBUS when your disk becomes full, and
currently we are not doing anything about it. It's sometimes hard to notice
that that was caused by the lack of disk space.

I wonder if we should print out a hint (e.g. "Bus error -- disk full?")
when we receive a SIGBUS. Any opinions?

What about reading back the file's size with stat() before mapping?
.st_blocks should give the "real" size.

BTW, posix_fallocate() might provide better portability and decrease the
likelihood of falling back on ftruncate().

If your system does not support fallocate(2), we use ftruncate(2) to
create an output file. fallocate(2) succeeds even if your disk have less
space than the requested size, because it creates a sparse file. If you
mmap such sparse file, you'll receive a SIGBUS when the disk actually
becomes full.

So, lld can die suddenly with SIGBUS when your disk becomes full, and
currently we are not doing anything about it. It's sometimes hard to notice
that that was caused by the lack of disk space.

I wonder if we should print out a hint (e.g. "Bus error -- disk full?")
when we receive a SIGBUS. Any opinions?

What about reading back the file's size with stat() before mapping?
.st_blocks should give the "real" size.

Creating a sparse file is not an error, and a sparse file works fine as
long as your disk has enough space (which is almost always true.) So I
don't know how stat helps.

BTW, posix_fallocate() might provide better portability and decrease the
likelihood of falling back on ftruncate().

Yes, but we want to avoid that function because when it falls back, it is
very slow. What it does when fallocate(2) is not available is to actually
write 0 to every block to make sure that all disk blocks are allocated.
Doing it every time the linker creates a new file is a bit overkill to
guard against disk full situation which isn't common.

For Zig we use LLD as a library. So for us it would be better to avoid global state such as SIGBUS (or any other signal handlers), instead returning an error from the link function when linking fails. If lld can encapsulate this signal handling and prevent the application using lld from getting the signal directly, instead carefully handling the signal in LLD itself and translating it into a proper error code or message, this would be reasonable.

For Zig we use LLD as a library. So for us it would be better to avoid
global state such as SIGBUS (or any other signal handlers), instead
returning an error from the link function when linking fails. If lld can
encapsulate this signal handling and prevent the application using lld from
getting the signal directly, instead carefully handling the signal in LLD
itself and translating it into a proper error code or message, this would
be reasonable.

Signal handlers and signal masks are inherently process-wide, so there's no
way to encapsulate them to lld functions. So my plan is to change the name
of lld::{coff,elf}::link's `ExitEarly` parameter to `IsStandalone`, and we
(not only call exit(2) but also) set a signal handler only when the
argument is true. Since library users pass false to the parameter, it
shouldn't change the behavior of lld for the library use case.

For Zig we use LLD as a library. So for us it would be better to avoid
global state such as SIGBUS (or any other signal handlers), instead
returning an error from the link function when linking fails. If lld can
encapsulate this signal handling and prevent the application using lld from
getting the signal directly, instead carefully handling the signal in LLD
itself and translating it into a proper error code or message, this would
be reasonable.

Signal handlers and signal masks are inherently process-wide, so there's
no way to encapsulate them to lld functions. So my plan is to change the
name of lld::{coff,elf}::link's `ExitEarly` parameter to `IsStandalone`,
and we (not only call exit(2) but also) set a signal handler only when the
argument is true. Since library users pass false to the parameter, it
shouldn't change the behavior of lld for the library use case.

This sounds good to me.

And then if an application wants to handle the SIGBUS correctly, it would
have to register this signal handler to report the error?

For Zig we use LLD as a library. So for us it would be better to avoid
global state such as SIGBUS (or any other signal handlers), instead
returning an error from the link function when linking fails. If lld can
encapsulate this signal handling and prevent the application using lld from
getting the signal directly, instead carefully handling the signal in LLD
itself and translating it into a proper error code or message, this would
be reasonable.

Signal handlers and signal masks are inherently process-wide, so there's
no way to encapsulate them to lld functions. So my plan is to change the
name of lld::{coff,elf}::link's `ExitEarly` parameter to `IsStandalone`,
and we (not only call exit(2) but also) set a signal handler only when the
argument is true. Since library users pass false to the parameter, it
shouldn't change the behavior of lld for the library use case.

This sounds good to me.

And then if an application wants to handle the SIGBUS correctly, it would
have to register this signal handler to report the error?

I could export a function that sets a signal handler as part of the library
interface, but I'm reluctant to do that because you can do the same thing
in a few lines of C code.

Can I ask you a question? I wonder if there's a reason to not call fork
before calling lld's main function.

It's hard to imagine a "correct" handling for a bus error. But reporting
the error seems like a good idea. Keeping in mind the fact that there's
very limited things you can do from a signal handler. See signal(7) for
details.

ian

Exactly. What I was thinking is to call `write(STDERR_FILENO, "Bus error --
disk full?\n",...)` and then call `_exit` when a SIGBUS is raised.

For Zig we use LLD as a library. So for us it would be better to avoid
global state such as SIGBUS (or any other signal handlers), instead
returning an error from the link function when linking fails. If lld can
encapsulate this signal handling and prevent the application using lld from
getting the signal directly, instead carefully handling the signal in LLD
itself and translating it into a proper error code or message, this would
be reasonable.

Signal handlers and signal masks are inherently process-wide, so there's
no way to encapsulate them to lld functions. So my plan is to change the
name of lld::{coff,elf}::link's `ExitEarly` parameter to `IsStandalone`,
and we (not only call exit(2) but also) set a signal handler only when the
argument is true. Since library users pass false to the parameter, it
shouldn't change the behavior of lld for the library use case.

This sounds good to me.

And then if an application wants to handle the SIGBUS correctly, it would
have to register this signal handler to report the error?

I could export a function that sets a signal handler as part of the
library interface, but I'm reluctant to do that because you can do the same
thing in a few lines of C code.

Can I ask you a question? I wonder if there's a reason to not call fork
before calling lld's main function.

It's starting to look to me like that might be necessary to integrate with
LLD, if I want to handle this SIGBUS error. I'd like to handle out of disk
space gracefully and not crash.

Here are my concerns:

* My frontend is cross-platform, so I would also need to figure out how
this would work on Windows, and eventually on other OSes too. Installation
of my frontend is simpler if there is not more than 1 executable to
distribute.
* This isn't working right now, but I want the errors of linking to
provide meaningful error codes and other metadata in a format the frontend
can associate with its own state and and handle/render the errors in its
own way. If this is done via IPC it has to go through a
serialization/deserialization step.
* One of the use cases for my frontend is as a server where it may invoke
the linker repeatedly. I haven't tested the difference in performance or
resource usage yet, but it seems to me that LLD as a library / no forking
would be more efficient.
* I would rather compile against LLVM and clang statically for
performance and ease of installation (at least on Windows). If LLD is a
separate binary it would additionally need LLVM/Clang linked in statically,
and this is a duplicate copy of LLVM/Clang, doubling the size of my
compiler releases. Further the LLVM initialization code can be shared
between my frontend and LLD when linked into the same binary.

For Zig we use LLD as a library. So for us it would be better to avoid
global state such as SIGBUS (or any other signal handlers), instead
returning an error from the link function when linking fails. If lld can
encapsulate this signal handling and prevent the application using lld from
getting the signal directly, instead carefully handling the signal in LLD
itself and translating it into a proper error code or message, this would
be reasonable.

Signal handlers and signal masks are inherently process-wide, so
there's no way to encapsulate them to lld functions. So my plan is to
change the name of lld::{coff,elf}::link's `ExitEarly` parameter to
`IsStandalone`, and we (not only call exit(2) but also) set a signal
handler only when the argument is true. Since library users pass false to
the parameter, it shouldn't change the behavior of lld for the library use
case.

This sounds good to me.

And then if an application wants to handle the SIGBUS correctly, it
would have to register this signal handler to report the error?

I could export a function that sets a signal handler as part of the
library interface, but I'm reluctant to do that because you can do the same
thing in a few lines of C code.

Can I ask you a question? I wonder if there's a reason to not call fork
before calling lld's main function.

It's starting to look to me like that might be necessary to integrate with
LLD, if I want to handle this SIGBUS error. I'd like to handle out of disk
space gracefully and not crash.

Here are my concerns:

* My frontend is cross-platform, so I would also need to figure out how
this would work on Windows, and eventually on other OSes too. Installation
of my frontend is simpler if there is not more than 1 executable to
distribute.
* This isn't working right now, but I want the errors of linking to
provide meaningful error codes and other metadata in a format the frontend
can associate with its own state and and handle/render the errors in its
own way. If this is done via IPC it has to go through a
serialization/deserialization step.
* One of the use cases for my frontend is as a server where it may invoke
the linker repeatedly. I haven't tested the difference in performance or
resource usage yet, but it seems to me that LLD as a library / no forking
would be more efficient.
* I would rather compile against LLVM and clang statically for
performance and ease of installation (at least on Windows). If LLD is a
separate binary it would additionally need LLVM/Clang linked in statically,
and this is a duplicate copy of LLVM/Clang, doubling the size of my
compiler releases. Further the LLVM initialization code can be shared
between my frontend and LLD when linked into the same binary.

Thanks Andrew for writing this up. Your feedback as an actual user of
lld-as-a-library is very valuable to me. I think I understand that you want
to avoid using fork for the various reasons, and I agree with all these
points. Here are my random thoughts.

- As to SIGBUS, we can call posix_fallocate to allocate disk space on any
file system, so that no SIGBUS will be raised later. We can do that only
when "IsStandalone" is false.
- Orthogonal to that, you can execute lld as a separate process, while
keeping your distribution a single binary. You can add a hidden flag to
your command line interface, and when that specific flag is given you can
call lld's main function so that the process acts as lld.
- For a long running server process, I might still want to run lld as a
separate process, so that lld's bug wouldn't crash the entire server
process.

...

It's starting to look to me like that might be necessary to integrate with
LLD, if I want to handle this SIGBUS error. I'd like to handle out of disk
space gracefully and not crash.

It seems to me like command-line-lld should behave this way also. Though I
can appreciate the convenience and benefits of mmap() over write(). And I
would expect the performance impact of posix_fallocate() could be
significant on some filesystems, as Rui indicates. It's too bad there's no
interface to put it in the background and find out its progress other than
"gee, what happens if we touch this page?"

Would it make any sense to amortize the expense over multiple
posix_fallocate() calls? Just posix_fallocate() enough to hit whatever
offset we're using into the mmap plus the size of the write (rounded up to
the next block). I suppose it's a relatively inexpensive call if we've
already allocated enough space. If the posix_fallocate() fails we can
gracefully report the disk space exhaustion.

Date: Mon, 23 Oct 2017 15:21:25 -0700
From: Rui Ueyama via llvm-dev <llvm-dev@lists.llvm.org>

If your system does not support fallocate(2), we use ftruncate(2) to create
an output file. fallocate(2) succeeds even if your disk have less space
than the requested size, because it creates a sparse file. If you mmap such
sparse file, you'll receive a SIGBUS when the disk actually becomes full.

So, lld can die suddenly with SIGBUS when your disk becomes full, and
currently we are not doing anything about it. It's sometimes hard to notice
that that was caused by the lack of disk space.

I wonder if we should print out a hint (e.g. "Bus error -- disk full?")
when we receive a SIGBUS. Any opinions?

I'm not a huge fan of catching "fatal" signals like this. It tends to
make debugging more difficult as you don't get a core dump anymore.
And since SIGBUS is also generated for unaligned access that is
somewhat annoying.

If you go this route, please realize that:

* Some systems actually generate SIGSEGV in this scenario.

* You can only call functions that are async-signal safe. Hight-level
  output functions (stdio, iostream) are almost certainly not
  async-signal safe.

* You may be able to distinguish a SIGBUS caused by unaligned access
  from the "disk-full" case by looking at siginfo, but beware of
  portability issues there.

* You probably only want to install the handler on systems that lack
  fallocate(2).

I think you present a compelling reason to implement fallocate(2) on
OpenBSD.

Cheers,

Mark

Rui Ueyama via llvm-dev <llvm-dev@lists.llvm.org> writes:

If your system does not support fallocate(2), we use ftruncate(2) to create
an output file. fallocate(2) succeeds even if your disk have less space
than the requested size, because it creates a sparse file. If you mmap such
sparse file, you'll receive a SIGBUS when the disk actually becomes full.

So, lld can die suddenly with SIGBUS when your disk becomes full, and
currently we are not doing anything about it. It's sometimes hard to notice
that that was caused by the lack of disk space.

I wonder if we should print out a hint (e.g. "Bus error -- disk full?")
when we receive a SIGBUS. Any opinions?

I think we should change the llvm implementation of resize_file to fail
if it cannot allocate the space. That is, it should only use ftruncate
on OS X where apparently HFS allocates space with it.

If resize_file fails than lld can fail gracefully or use annonymous
memory and a plain write instead of mmap for producing the output.

Cheers,
Rafael

But that would disable mmap IO on systems that don’t support fallocate. I’m not sure if OpenBSD people are for example happy about that.

Rui Ueyama <ruiu@google.com> writes:

But that would disable mmap IO on systems that don't support fallocate. I'm
not sure if OpenBSD people are for example happy about that.

Only for output. It is hard to guess the preference of others, but if it
was the system I was using I would prefer a more reliable linker until
something like fallocate was provided by the file system.

I remember trying to switch lld to always use an anonymous buffer, and
it was really not that bad.

Cheers,
Rafael

Note that posix_fallocate may return EINVAL if the underlying
filesystem does not support the operation.

Fair. I'll try to do that first before exploring the idea of setting a
signal handler.

Rui Ueyama <ruiu@google.com> writes:

Rui Ueyama <ruiu@google.com> writes:

> But that would disable mmap IO on systems that don't support fallocate.
I'm
> not sure if OpenBSD people are for example happy about that.

Only for output. It is hard to guess the preference of others, but if it
was the system I was using I would prefer a more reliable linker until
something like fallocate was provided by the file system.

I remember trying to switch lld to always use an anonymous buffer, and
it was really not that bad.

Fair. I'll try to do that first before exploring the idea of setting a
signal handler.

And if the performance of always using an anonymous buffer is too low,
we could use an abstraction like:

  class OutputBuffer {
        // allocate FD to Size if it is supported by the filesystem and
        // mmap it. If pre allocation fails, get an anonymous memory
        // buffer of Size bytes.
        MMapOutput(int FD, size_t Size);

        // Return the mmaped or anonymous buffer.
        void *getBuffer();
        private:
        // The buffer if we could not pre allocated the file.
        std::unique_ptr<void *> Buffer;
  };

Cheers,
Rafael