RFC: Implementing -fno-delete-null-pointer-checks in clang

Hi,

This is regarding support for -fno-delete-null-pointer-checks in clang (PR
9251).
Linux kernel uses this flag to keep null pointer checks from getting
optimized away. Since clang does not currently support this flag, it often
invites comments from kernel devs that clang is not yet *ready* to compile
Linux kernel.
I have also heard that developers working on firmware, bare-metal tools and
other low level tools often want to use this flag as well.

Therefore, I would like to implement support for this flag (maybe with a
different name), and would like to know the opinions on how it should be
implemented.

Some options/opinions:

1 (From Duncan Sands comment at PR 9251)

This could be implemented by putting pointers in a non-default address

space when this flag is passed.
Any ideas how will this work for languages/targets that actively make use
of non-default address spaces e.g. OpenCL/GPU targets. Also, I believe
that allocas still need to kept in default address space so they need to be
bitcast to the *no-default* address space before use.

2. Use this flag like any other optimization flag. Find and fix all
optimizations in clang/llvm related to null pointer accesses.

Thanks,
Manoj

Background:
https://lkml.org/lkml/2018/4/4/601

From Linus Torvalds:

Note that we explicitly use "-fno-delete-null-pointer-checks" because
we do *not* want NULL pointer check removal even in the case of a bug.

See commit a3ca86aea507 ("Add '-fno-delete-null-pointer-checks' to gcc
CFLAGS") for the reason: we had buggy code that accessed a pointer
before the NULL pointer check, but the bug was "benign" as long as the
compiler didn't actually remove the check.

And note how the buggy code *looked* safe. It was doing the right
check, it was just that the check was hidden and disabled by another
bug.

Removing the NULL pointer check turned a benign bug into a trivially
exploitable one by just mapping user space data at NULL (which avoided
the kernel oops, and then made the kernel use the user value!).

Now, admittedly we have a ton of other security features in place to
avoid these kinds of issues - SMAP helps on the hardware side, and not
allowing users to mmap() NULL in the first place helps with most
distributions, but the point remains: the kernel generally really
doesn't want optimizations that are perhaps allowed by the standard,
but that result in code generation that doesn't match the source code.

The NULL pointer removal is one such thing: don't remove checks in the
kernel based on "standard says". It's ok to do optimizations that are
based on "hardware does the exact same thing", but not on "the
standard says this is undefined so we can remove it".

I'd suggest -mdo-what-i-mean; the whole idea is horribly
underspecified, and basically rips up the LangRef in favour of a
nebulous set of good and bad optimizations (probably as dictated by
the ones that have bitten someone who wrote bad code recently and was
grumpy enough about it to complain at us).

In particular I'm skeptical that the address space solution actually
works. What they actually mean is probably not so much about removing
the icmps, but about the control-flow decisions based on them. And
LLVM might not be aware that any given conditional branch comes from a
tainted comparison.

Cheers.

Tim.

Despite the name, the flag actually has rather straightforward semantics from the compiler's perspective. From the gcc docs for -fdelete-null-pointer-checks: "Assume that programs cannot safely dereference null pointers, and that no code or data element resides at address zero." (-fno-delete-null-pointer-checks is the opposite.)

-Eli

Despite the name, the flag actually has rather straightforward semantics
from the compiler's perspective. From the gcc docs for
-fdelete-null-pointer-checks: "Assume that programs cannot safely
dereference null pointers, and that no code or data element resides at
address zero." (-fno-delete-null-pointer-checks is the opposite.)

Ah, now that's quite a bit more palatable. I really should have read
the docs before spouting "my favourite rant #1". Then my main concern
is that we end up with a sufficiently clear (and documented)
definition that we're not promising more than we intend. I get very
grumpy if I can't tell someone with UB that they're Doing It Wrong.

Of the two options, I still think the second is a non-starter.
Something between the two might be a datalayout flag specifying
addrspace(0) behaviour. It's pretty easy to argue that it'd be good if
code used some kind of
"DataLayout::isPointerIntrinsicallyInvalid(Value *)" for this kind of
thing anyway (rename or relocate at will).

And the name really is terrible, we should change it if at all feasible

Tim.

Despite the name, the flag actually has rather straightforward semantics
from the compiler's perspective. From the gcc docs for
-fdelete-null-pointer-checks: "Assume that programs cannot safely
dereference null pointers, and that no code or data element resides at
address zero." (-fno-delete-null-pointer-checks is the opposite.)

Ah, now that's quite a bit more palatable. I really should have read
the docs before spouting "my favourite rant #1". Then my main concern
is that we end up with a sufficiently clear (and documented)
definition that we're not promising more than we intend. I get very
grumpy if I can't tell someone with UB that they're Doing It Wrong.

Of the two options, I still think the second is a non-starter.
Something between the two might be a datalayout flag specifying
addrspace(0) behaviour.

But then we're back to auditing everything (FWIW, I'm not sure there's
going to be a great alternative to an audit). What behavior would you
like under LTO? We could use a different address space by default
(although, unless you also change the address space used for allocas,
you're going to end up with addressspace casts all over the place).

-Hal

Despite the name, the flag actually has rather straightforward semantics
from the compiler’s perspective. From the gcc docs for
-fdelete-null-pointer-checks: “Assume that programs cannot safely
dereference null pointers, and that no code or data element resides at
address zero.” (-fno-delete-null-pointer-checks is the opposite.)

Ah, now that’s quite a bit more palatable. I really should have read
the docs before spouting “my favourite rant #1”. Then my main concern
is that we end up with a sufficiently clear (and documented)
definition that we’re not promising more than we intend. I get very
grumpy if I can’t tell someone with UB that they’re Doing It Wrong

+1 for implementing the feature.

Of the two options, I still think the second is a non-starter.

Something between the two might be a datalayout flag specifying
addrspace(0) behaviour. It’s pretty easy to argue that it’d be good if
code used some kind of
“DataLayout::isPointerIntrinsicallyInvalid(Value *)” for this kind of
thing anyway (rename or relocate at will)

Whether or not this is put into DataLayout, moving all the null-related addrSpace != 0 checks into an accessor seems like a great idea. Besides this feature request, presumably there’s also other uses of non-zero address spaces for which the pointer 0 could usefully be treated as invalid…

And the name really is terrible, we should change it if at all feasible

Yep. “-fnull-pointer-is-valid” has been suggested before.

-fplacate-linux-kernel-developers ?

Csaba

-std=vaguely-like-c99?

David

Please, lets keep this discussion on topic and productive. The semantics described upstream of (the inverse of) "Assume that programs cannot safely dereference null pointers, and that no code or data element resides at address zero.” is entirely reasonable.

-Chris

I disagree, because it depends on what you mean by dereference. If it is safe to dereference NULL, then that means that it is also safe to hoist the null dereference above the check. For example:

  if (x == NULL)
    return;
  y = *x;

Is safe to transform into:

  y = *x;
  if (x == NULL)
    return;

The load is guaranteed not to trap by the fact that NULL a dereferencable address.

As I understand it, the issue in GCC was that they treated the address calculation as a dereference. The word ‘dereference’ does not appear in the C standard, which talks only about use, and therefore treats *x and &x->y both as uses. This is intentional, because on segmented architectures it is possible that NULL + X will give you an out-of-bounds and unrepresentable pointer, possibly even causing a trap.

A Linux kernel compiled with clang was not vulnerable, because LLVM was not eliding this check, which makes this flag somewhat pointless. LLVM assumed that a reachable load or a store of an pointer guaranteed that it was not NULL, but that a GEP did not.

I believe that the correct fix is to explicitly permit GEPs with a NULL base in the LLVM IR spec. As I understand it, this is already implicitly permitted because code that predates __builtin_offsetof uses a cast of 0 to a struct type and takes the address of a field to get the offset.

Any code that is loading or storing from null prior to a null check probably doesn’t mind if the null check is then elided, because it’s going to trap anyway (supporting C code that allows loads and stores from an address with a bit pattern of 0 when interpreted as an integer is incredibly hard, as our friends at IBM can attest).

David

I disagree, because it depends on what you mean by dereference. If it is safe to dereference NULL, then that means that it is also safe to hoist the null dereference above the check. For example:

Nope. The intent is that NULL is potentially dereferenceable, just as any other address would be. Not that it’s known to always be dereferenceable.

if (x == NULL)
return;
y = *x;

Is safe to transform into:

y = *x;
if (x == NULL)
return;

The load is guaranteed not to trap by the fact that NULL a dereferencable address.

That transform is still not safe with the new flag, because you do not know that address 0 is dereferenceable in that code. You also don’t know that it’s definitely not dereferenceable, however – which is the new behavior.

Any code that is loading or storing from null prior to a null check probably doesn’t mind if the null check is then elided, because it’s going to trap anyway (supporting C code that allows loads and stores from an address with a bit pattern of 0 when interpreted as an integer is incredibly hard, as our friends at IBM can attest).

No – the kernel does mind exactly this, because in some circumstances, dereferencing null does not trap in kernel context. So you end up with both no trap and no check, and a security vulnerability. That’s less true now, with the various other protections e.g. min_mmap_address sysctl, SMAP, and others, but still, the desire is to keep this from happening.

Here is the commit which started using the flag in the kernel:

commit a3ca86aea507904148870946d599e07a340b39bf

Author: Eugene Teo <eteo@redhat.com>

Add ‘-fno-delete-null-pointer-checks’ to gcc CFLAGS

Turning on this flag could prevent the compiler from optimising away

some “useless” checks for null pointers. Such bugs can sometimes become

exploitable at compile time because of the -O2 optimisation.

See http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Optimize-Options.html

An example that clearly shows this ‘problem’ is commit 6bf67672.

static void __devexit agnx_pci_remove(struct pci_dev *pdev)

{

struct ieee80211_hw *dev = pci_get_drvdata(pdev);

  • struct agnx_priv *priv = dev->priv;
  • struct agnx_priv *priv;

AGNX_TRACE;

if (!dev)

return;

  • priv = dev->priv;

By reverting this patch, and compile it with and without

-fno-delete-null-pointer-checks flag, we can see that the check for dev

is compiled away.

call printk #

  • testq %r12, %r12 # dev

  • je .L94 #,

movq %r12, %rdi # dev,

Clearly the ‘fix’ is to stop using dev before it is tested, but building

with -fno-delete-null-pointer-checks flag at least makes it harder to

abuse.

Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>

Acked-by: Eric Paris <eparis@redhat.com>

Acked-by: Wang Cong <amwang@redhat.com>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Hi,

Turning this on its head. Adding a warning saying something like
"Warning: NULL point check being optimized out at line X" might be
useful.
The developer might spot that and think "Ah! maybe I put that check in
the wrong place!"

Kind Regards

James

Fair warning, the following is a devil's advocate position, but it's also a serious question.

Given the entire point of this flag appears to be bug mitigation, why not frame this as a sanitizer? If we had a hypothetical -fsanitize=dereference which tried to catch dereferenced pointers derived from null, wouldn't that handle the case at hand?

i.e. Why are we focused on *hiding* the bug instead of *finding and fixing* the bug?

Philip

p.s. The above is written with the understanding that the code compiled with this flag doesn’t actually place valid objects at null. I know that is not true for some embedded systems, but that seems distinct from this proposal.

It's called "-fsanitize=null": it catches stuff like "x[3]" where x is null. It's not quite complete; we don't check for arithmetic on a null pointer.

Yes, that would handle the situation in question, but putting implicit null checks all over the place is pretty expensive; I don't think most people would turn that on in production.

-Eli

We had a similar discussion on an internal thread a while back if we can use “-fsanitize=null” where clang
would generate ud2 instruction for null pointer dereferences. Unfortunately, this doesn’t work in kernel context.

Quoting the reply from our kernel team:

"It will not cause a kernel panic: it’s an exception trigger, and it’s

up to the exception handler to decide if it will return (WARN) or not
(BUG). In the referenced function, this is calling WARN_ON() which
will resume execution. (And note that the BUG() implementations are
specifically marked with attribute((noreturn)). "

(again, knowingly playing devil's advocate)

But wouldn't it catch - during development, before release - the bug which motivates the desire for the mitigation? If so, why worry about the mitigation? Is the concern insufficient testing/coverage?

And alternatively, the performance overhead could be likely to be improved if someone wanted to spend some time doing so. Between fully exploiting implicit null checks (support in tree today) and imprecise fault locations(*), I suspect the overhead could be made quite small.

* Terminology: "implicit null check" is simply constructing a load or store which is guaranteed to fault if the preceding null check would have failed. "imprecise fault location" is allowing the failure to happen earlier or later in the execution so long as additional side effect is not visible. If you're okay with earlier side effects not being visible, that opens up a lot of flexibility. Similar, allowing some side effects (but not say, IO) does as well.

Philip

If the kernel can't use -fsanitize-trap, it could use some alternative like "-fsanitize=null -fno-sanitize-recover=null -fsanitize-minimal-runtime". That doesn't seem like a fundamental flaw in the approach.

-Eli