(not) initializing assembly outputs with -ftrivial-auto-var-init

If an asm's constraints claim that the variable is an output, but then don't actually write to it, that's a bug (at least if the value is actually used afterwards). An output-only constraint on inline asm definitely does _not_ mean "pass through the previous value unchanged, if the asm failed to actually write to it". If you need that behavior, it's spelled "+m", not "=m".

We do seem to fail to take advantage of this for memory outputs (again, this is not just for ftrivial-auto-var-init -- we ought to eliminate manual initialization just the same), which I'd definitely consider an missing-optimization bug.

You mean we assume C code is buggy and asm code is not buggy because
compiler fails to disprove that there is a bug?
Doing this optimization without -ftrivial-auto-var-init looks
reasonable, compilers do optimizations assuming absence of bugs
throughout. But -ftrivial-auto-var-init is specifically about assuming
these bugs are everywhere.

Please be more specific about the problem, because your simplified example doesn't actually show an issue. If I write this function:
int foo() {
int retval;
asm("# ..." : "=r"(retval));
return retval;
}
it already does get treated as definitely writing retval, and optimizes away the initialization (whether you explicitly initialize retval, or use -ftrivial-auto-var-init).
Example: https://godbolt.org/z/YYBCXL

This is probably because you're passing retval as a register output.
If you change "=r" to "=m" (https://godbolt.org/z/ulxSgx), it won't be
optimized away.
(I admit I didn't know about the difference)

Hi JF et al.,

In the Linux kernel we often encounter the following pattern:

type op(...) {
type retval;
inline asm(... retval ...);
return retval;
}

, which is used to implement low-level platform-dependent memory operations.

Some of these operations turn out to be very hot, so we probably don't
want to initialize |retval| given that it's always initialized in the
assembly.

However it's practically impossible to tell that a variable is being
written to by the inline assembly, or figure out the size of that
write.
Perhaps we could speculatively treat every scalar output of an inline
assembly routine as an initialized value (which is true for the Linux
kernel, but I'm not sure about other users of inline assembly, e.g.
video codecs).

WDYT?

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Does kernel asm use "+m" or "=m"?

If asm _must_ write to that variable, then we could improve DSE in
normal case (ftrivial-auto-var-init is not enabled). If
ftrivial-auto-var-init is enabled, then strictly saying we should not
remove initialization because we did not prove that asm actually
writes. But we may remove initialization as well for practical
reasons.

Alex mentioned that in some cases we don't know actual address/size of
asm writes. But we should know it if a local var is passed to the asm,
which should be the case for kernel atomic asm blocks.

Interestingly, ftrivial-auto-var-init DSE must not be stronger then
non-ftrivial-auto-var-init DSE, unless we are talking about our own
emitted initialization stores, in such case ftrivial-auto-var-init DSE
may remove then more aggressively then what normal DSE would do, we
don't actually have to _prove_ that the init store is dead.

IMO the auto var init mitigation shouldn’t change the DSE optimization at all. We shouldn’t treat the stores we add any different. We should just improve DSE and everything benefits (auto var init moreso).

But you realize that this "just" improve involves fully understanding
static and dynamic behavior of arbitrary assembly for any architecture
without even using integrated asm? :wink:

If you want to solve every problem however unlikely, yes. If you narrow what you’re doing to a handful of cases that matter, no.

How can we improve DSE to handle all main kernel patterns that matter?
Can we? It's still unclear to me. Extending this optimization to
generic DSE and all stores can make it much harder (unsolvable)
problem...

Right now there's a handful of places in the kernel where we have to
use __attribute__((uninitialized)) just to avoid creating an extra
initializer: https://github.com/google/kmsan/commit/00387943691e6466659daac0312c8c5d8f9420b9
and https://github.com/google/kmsan/commit/2954f1c33a81c6f15c7331876f5b6e2fec0d631f
All those assembly directives are using local scalar variables of size
<= 8 bytes as "=qm" outputs, so we can narrow the problem down to "let
DSE remove redundant stores to local scalars that are used as asm()
"m" outputs"
False positives will sure be possible in theory, but hopefully rare in practice.

Right, you only need to teach the optimizer about asm that matters. You don’t need “extending this optimization to generic DSE”. What I’m saying is: this is generic DSE, nothing special about variable auto-init, except we’re making sure it help variable auto-init a lot. i.e. there’s no `if (VariableAutoInitIsOn)` in LLVM, there’s just some DSE smarts that are likely to kick in a lot more when variable auto-init is on.

It doesn't have to be "if (VariableAutoInitIsOn), turn on DSE", it
could be just "If this is an assembly output, emit an
__attribute__((uninitialized)) for it”.

That’s something I would really rather avoid. It’s much better to make DSE more powerful than to play around with how clang generates variable auto-init.

I would still love to know what's the main source of truth for the
semantics of asm() constraints.

I don’t think you can trust programmer-provided constraints, unless you also add diagnostics to warn on incorrect constraints.

But then it's nothing left to trust. We sure don't want to parse the
assembly itself to reason about its behavior, so the constraints is
the only thing that lets us understand whether a variable is going to
be written to.

I thin you do want to look into the assembly. Have you tried instrumenting clang to dump out all assembly strings? What are in those strings?

To some extent I did. I had to solve the same problem for
KernelMemorySanitizer to avoid false positives on values coming from
the assembly.

Here's an incomplete list of problems I decided not to deal with,
ending up with a heuristic based on output constraints and dynamic
address checks:
1. Assembly operates with entities that don't directly map to C
language constructs (registers, segments, flags, program counter).
Aliasing rules also don't work with assembly, and the memory model is
different from that offered by C.
2. Right now Clang doesn't use the integrated assembly to build the
kernel, so it's hard to leverage any of the existing frameworks to
parse the assembly or perform optimizations on it
(also note that the existing opportunities to optimize inline assembly
are quite limited, e.g. Clang is even unable to optimize away a
duplicate "mov %eax, %edx" instruction).
3. One has to solve the problem for every architecture supported by
the compiler.
4. The kernel is using a long tail of weird instructions that are
never used in the userspace code.
5. Certain inline assembly calls (e.g. for per-CPU variables) are
designed with Linux kernel implementation details in mind, and don't
make sense outside that. Ad-hoc code to handle them will have to be in
sync with the kernel source.

Instead of reasoning about particular kernel idioms on certain arches,
I'd prefer having all of that inline assembly translated to compiler
builtins with known semantics.
But I'm not sure that's currently possible, because AFAIU the kernel
developers are under the impression of those builtins being slower
than raw assembly (which I can also imagine).

Having said that, I suspect that we can do a good job in 95% cases
just relying on the constraints, provided that those have strict
semantics we all agree on.
Not sure we want to specifically do anything about malformed
constraints, as people who write inline assembly always have ways to
shoot themselves in the foot.

>>>>>>>> If an asm's constraints claim that the variable is an output, but then don't actually write to it, that's a bug (at least if the value is actually used afterwards). An output-only constraint on inline asm definitely does _not_ mean "pass through the previous value unchanged, if the asm failed to actually write to it". If you need that behavior, it's spelled "+m", not "=m".
>>>>>>>>
>>>>>>>> We do seem to fail to take advantage of this for memory outputs (again, this is not just for ftrivial-auto-var-init -- we ought to eliminate manual initialization just the same), which I'd definitely consider an missing-optimization bug.
>>>>>>>
>>>>>>> You mean we assume C code is buggy and asm code is not buggy because
>>>>>>> compiler fails to disprove that there is a bug?
>>>>>>> Doing this optimization without -ftrivial-auto-var-init looks
>>>>>>> reasonable, compilers do optimizations assuming absence of bugs
>>>>>>> throughout. But -ftrivial-auto-var-init is specifically about assuming
>>>>>>> these bugs are everywhere.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please be more specific about the problem, because your simplified example doesn't actually show an issue. If I write this function:
>>>>>>>>>> int foo() {
>>>>>>>>>> int retval;
>>>>>>>>>> asm("# ..." : "=r"(retval));
>>>>>>>>>> return retval;
>>>>>>>>>> }
>>>>>>>>>> it already does get treated as definitely writing retval, and optimizes away the initialization (whether you explicitly initialize retval, or use -ftrivial-auto-var-init).
>>>>>>>>>> Example: https://godbolt.org/z/YYBCXL
>>>>>>>>> This is probably because you're passing retval as a register output.
>>>>>>>>> If you change "=r" to "=m" (https://godbolt.org/z/ulxSgx), it won't be
>>>>>>>>> optimized away.
>>>>>>>>> (I admit I didn't know about the difference)
>>>>>>>>>>>
>>>>>>>>>>> Hi JF et al.,
>>>>>>>>>>>
>>>>>>>>>>> In the Linux kernel we often encounter the following pattern:
>>>>>>>>>>>
>>>>>>>>>>> type op(...) {
>>>>>>>>>>> type retval;
>>>>>>>>>>> inline asm(... retval ...);
>>>>>>>>>>> return retval;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> , which is used to implement low-level platform-dependent memory operations.
>>>>>>>>>>>
>>>>>>>>>>> Some of these operations turn out to be very hot, so we probably don't
>>>>>>>>>>> want to initialize |retval| given that it's always initialized in the
>>>>>>>>>>> assembly.
>>>>>>>>>>>
>>>>>>>>>>> However it's practically impossible to tell that a variable is being
>>>>>>>>>>> written to by the inline assembly, or figure out the size of that
>>>>>>>>>>> write.
>>>>>>>>>>> Perhaps we could speculatively treat every scalar output of an inline
>>>>>>>>>>> assembly routine as an initialized value (which is true for the Linux
>>>>>>>>>>> kernel, but I'm not sure about other users of inline assembly, e.g.
>>>>>>>>>>> video codecs).
>>>>>>>>>>>
>>>>>>>>>>> WDYT?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Alexander Potapenko
>>>>>>>>>>> Software Engineer
>>>>>>>>>>>
>>>>>>>>>>> Google Germany GmbH
>>>>>>>>>>> Erika-Mann-Straße, 33
>>>>>>>>>>> 80636 München
>>>>>>>>>>>
>>>>>>>>>>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>>>>>>>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>>>>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> cfe-dev mailing list
>>>>>>>>>>> cfe-dev@lists.llvm.org
>>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Alexander Potapenko
>>>>>>>>> Software Engineer
>>>>>>>>>
>>>>>>>>> Google Germany GmbH
>>>>>>>>> Erika-Mann-Straße, 33
>>>>>>>>> 80636 München
>>>>>>>>>
>>>>>>>>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>>>>>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>>>>>>> Sitz der Gesellschaft: Hamburg
>>>>>>
>>>>>> Does kernel asm use "+m" or "=m"?
>>>>>>
>>>>>> If asm _must_ write to that variable, then we could improve DSE in
>>>>>> normal case (ftrivial-auto-var-init is not enabled). If
>>>>>> ftrivial-auto-var-init is enabled, then strictly saying we should not
>>>>>> remove initialization because we did not prove that asm actually
>>>>>> writes. But we may remove initialization as well for practical
>>>>>> reasons.
>>>>>>
>>>>>> Alex mentioned that in some cases we don't know actual address/size of
>>>>>> asm writes. But we should know it if a local var is passed to the asm,
>>>>>> which should be the case for kernel atomic asm blocks.
>>>>>>
>>>>>> Interestingly, ftrivial-auto-var-init DSE must not be stronger then
>>>>>> non-ftrivial-auto-var-init DSE, unless we are talking about our own
>>>>>> emitted initialization stores, in such case ftrivial-auto-var-init DSE
>>>>>> may remove then more aggressively then what normal DSE would do, we
>>>>>> don't actually have to _prove_ that the init store is dead.
>>>>>
>>>>>
>>>>> IMO the auto var init mitigation shouldn’t change the DSE optimization at all. We shouldn’t treat the stores we add any different. We should just improve DSE and everything benefits (auto var init moreso).
>>>>
>>>> But you realize that this "just" improve involves fully understanding
>>>> static and dynamic behavior of arbitrary assembly for any architecture
>>>> without even using integrated asm? :wink:
>>>
>>> If you want to solve every problem however unlikely, yes. If you narrow what you’re doing to a handful of cases that matter, no.
>>
>> How can we improve DSE to handle all main kernel patterns that matter?
>> Can we? It's still unclear to me. Extending this optimization to
>> generic DSE and all stores can make it much harder (unsolvable)
>> problem...
>
> Right now there's a handful of places in the kernel where we have to
> use __attribute__((uninitialized)) just to avoid creating an extra
> initializer: https://github.com/google/kmsan/commit/00387943691e6466659daac0312c8c5d8f9420b9
> and https://github.com/google/kmsan/commit/2954f1c33a81c6f15c7331876f5b6e2fec0d631f
> All those assembly directives are using local scalar variables of size
> <= 8 bytes as "=qm" outputs, so we can narrow the problem down to "let
> DSE remove redundant stores to local scalars that are used as asm()
> "m" outputs"
> False positives will sure be possible in theory, but hopefully rare in practice.

Right, you only need to teach the optimizer about asm that matters. You don’t need “extending this optimization to generic DSE”. What I’m saying is: this is generic DSE, nothing special about variable auto-init, except we’re making sure it help variable auto-init a lot. i.e. there’s no `if (VariableAutoInitIsOn)` in LLVM, there’s just some DSE smarts that are likely to kick in a lot more when variable auto-init is on.

We can't start breaking correct user code because "hopefully rare in
practice". But we can well episodically omit our hardening
initializing store if in most cases it is not necessary but we are not
really sure, e.g. not sure what exactly memory an asm block writes.
There is huge difference complexity-wise between a 100% sound proof
and a best-effort hint. This is very special about auto-initializing
stores.

I mean, I agree, all others being equal we prefer handling it on
common grounds. But still don't see all others being equal here. From
what Alex says, it's not possible to figure out what exactly memory an
asm block writes.

I’m not sure what you mean. If the asm code says “=r” (or “=” anything) and fails to write a value to the variable, then you cannot use that value afterwards. That’s an error.

It’s just like “oops, I didn’t return a value from the function”:

int novalueret(int word) { if (word != 0) return 5; }

int f(int word) {
int ret;
ret = novalueret(word);
return ret;
}

A mechanism to verify the correctness of arbitrary inline-asm, and check its constraints match is certainly an interesting project, and one that’s been discussed before. But if that’s even feasible at ALL (which I’m rather skeptical of), that’s going to need to be an entirely separate sanitization/hardening/warning framework.

In the rest of the compiler, if the asm constraints say something, we NEED to trust that the instructions actually do so – having that power is what inline asm is FOR. There is nothing else useful that we can do.

If an asm’s constraints claim that the variable is an output, but then don’t actually write to it, that’s a bug (at least if the value is actually used afterwards). An output-only constraint on inline asm definitely does not mean “pass through the previous value unchanged, if the asm failed to actually write to it”. If you need that behavior, it’s spelled “+m”, not “=m”.

We do seem to fail to take advantage of this for memory outputs (again, this is not just for ftrivial-auto-var-init – we ought to eliminate manual initialization just the same), which I’d definitely consider an missing-optimization bug.

You mean we assume C code is buggy and asm code is not buggy because
compiler fails to disprove that there is a bug?
Doing this optimization without -ftrivial-auto-var-init looks
reasonable, compilers do optimizations assuming absence of bugs
throughout. But -ftrivial-auto-var-init is specifically about assuming
these bugs are everywhere.

Please be more specific about the problem, because your simplified example doesn’t actually show an issue. If I write this function:
int foo() {
int retval;
asm("# …" : “=r”(retval));
return retval;
}
it already does get treated as definitely writing retval, and optimizes away the initialization (whether you explicitly initialize retval, or use -ftrivial-auto-var-init).
Example: https://godbolt.org/z/YYBCXL

This is probably because you’re passing retval as a register output.
If you change “=r” to “=m” (https://godbolt.org/z/ulxSgx), it won’t be
optimized away.
(I admit I didn’t know about the difference)

Hi JF et al.,

In the Linux kernel we often encounter the following pattern:

type op(…) {
type retval;
inline asm(… retval …);
return retval;
}

, which is used to implement low-level platform-dependent memory operations.

Some of these operations turn out to be very hot, so we probably don’t
want to initialize |retval| given that it’s always initialized in the
assembly.

However it’s practically impossible to tell that a variable is being
written to by the inline assembly, or figure out the size of that
write.
Perhaps we could speculatively treat every scalar output of an inline
assembly routine as an initialized value (which is true for the Linux
kernel, but I’m not sure about other users of inline assembly, e.g.
video codecs).

WDYT?


Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg


cfe-dev mailing list
cfe-dev@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Does kernel asm use “+m” or “=m”?

If asm must write to that variable, then we could improve DSE in
normal case (ftrivial-auto-var-init is not enabled). If
ftrivial-auto-var-init is enabled, then strictly saying we should not
remove initialization because we did not prove that asm actually
writes. But we may remove initialization as well for practical
reasons.

Alex mentioned that in some cases we don’t know actual address/size of
asm writes. But we should know it if a local var is passed to the asm,
which should be the case for kernel atomic asm blocks.

Interestingly, ftrivial-auto-var-init DSE must not be stronger then
non-ftrivial-auto-var-init DSE, unless we are talking about our own
emitted initialization stores, in such case ftrivial-auto-var-init DSE
may remove then more aggressively then what normal DSE would do, we
don’t actually have to prove that the init store is dead.

IMO the auto var init mitigation shouldn’t change the DSE optimization at all. We shouldn’t treat the stores we add any different. We should just improve DSE and everything benefits (auto var init moreso).

But you realize that this “just” improve involves fully understanding
static and dynamic behavior of arbitrary assembly for any architecture
without even using integrated asm? :wink:

If you want to solve every problem however unlikely, yes. If you narrow what you’re doing to a handful of cases that matter, no.

How can we improve DSE to handle all main kernel patterns that matter?
Can we? It’s still unclear to me. Extending this optimization to
generic DSE and all stores can make it much harder (unsolvable)
problem…

Right now there’s a handful of places in the kernel where we have to
use attribute((uninitialized)) just to avoid creating an extra
initializer: https://github.com/google/kmsan/commit/00387943691e6466659daac0312c8c5d8f9420b9
and https://github.com/google/kmsan/commit/2954f1c33a81c6f15c7331876f5b6e2fec0d631f
All those assembly directives are using local scalar variables of size
<= 8 bytes as “=qm” outputs, so we can narrow the problem down to “let
DSE remove redundant stores to local scalars that are used as asm()
“m” outputs”
False positives will sure be possible in theory, but hopefully rare in practice.

Right, you only need to teach the optimizer about asm that matters. You don’t need “extending this optimization to generic DSE”. What I’m saying is: this is generic DSE, nothing special about variable auto-init, except we’re making sure it help variable auto-init a lot. i.e. there’s no if (VariableAutoInitIsOn) in LLVM, there’s just some DSE smarts that are likely to kick in a lot more when variable auto-init is on.

We can’t start breaking correct user code because "hopefully rare in
practice”.

I’m not advocating for this.

But we can well episodically omit our hardening
initializing store if in most cases it is not necessary but we are not
really sure, e.g. not sure what exactly memory an asm block writes.

I don’t agree. It’s a bad mitigation if it sometimes goes ¯_(ツ)_/¯

There is huge difference complexity-wise between a 100% sound proof
and a best-effort hint.

Correct, and I don’t think you need a 100% solution for DSE (i.e. you don’t need to understand the semantics of all assembly instructions for all ISAs). You just need to hit the cases that matter (some instructions on some ISAs), and have those cases remain 100% sound.

This is very special about auto-initializing
stores.

I mean, I agree, all others being equal we prefer handling it on
common grounds. But still don’t see all others being equal here. From
what Alex says, it’s not possible to figure out what exactly memory an
asm block writes.

Agreed, and I’m not saying that this needs to happen.

I’ll re-iterate: which asm statements result in extraneous initialization? What instructions are they?

This thread has IMO started going down an unfortunate path.

Normal compiler behavior and optimization passes (such as DSE, and everything else) should not care WHAT is inside the assembly string, but should just trust the asm-constraints to properly indicate the behavior of the contained assembly. If the asm constraint says it stores a value (which is what “=m” means), then the usual behavior of compiler should be to be to assume that it indeed does so. Doing otherwise starts to get into very-scary territory.

The initial problem here is that we do not properly tag memory-outputs of inline asm as definitely being a store to that memory. They should be so-tagged. When we fix that bug, then code like this (compiled with optimizations, but no special hardening flags):

int f() {
int out = 5;
asm("# Do nothing, LOL!" : “=m”(out));

return out;
}
will be compiled down to simply load an uninitialized stack value and return it.

movl -4(%rsp), %eax
ret

That is the correct and desired behavior. (And, implied by this is that with the current implementation of -ftrivial-auto-var-init, its initialization also will be eliminated.)

That said – if we want to implement inline-asm targeted mitigations in certain hardening modes, I’m not saying we cannot do that. It’s just that we need to be clear that it is special hardening behavior.

This thread has IMO started going down an unfortunate path.

Normal compiler behavior and optimization passes (such as DSE, and everything else) should not care WHAT is inside the assembly string, but should just trust the asm-constraints to properly indicate the behavior of the contained assembly. If the asm constraint says it stores a value (which is what "=m" means), then the usual behavior of compiler should be to be to assume that it indeed does so. Doing otherwise starts to get into very-scary territory.

The initial problem here is that we do not properly tag memory-outputs of inline asm as definitely being a store to that memory. They should be so-tagged. When we fix that bug, then code like this (compiled with optimizations, but no special hardening flags):
int f() {
  int out = 5;
  asm("# Do nothing, LOL!" : "=m"(out));
  return out;
}
will be compiled down to simply load an uninitialized stack value and return it.
        movl -4(%rsp), %eax
        ret
That is the _correct and desired_ behavior. (And, implied by this is that with the current implementation of -ftrivial-auto-var-init, its initialization also will be eliminated.)

If an lvalue is passed to =m what exactly is written? Single value of
the lvalue type as passed? Whole base object? What if it's a
memset-like asm block that writes an array? What if it writes a single
value but at some offset? What if it writes as single value, but size
of the write does not match the static type? I think I've seen asm
blocks of all types in kernel.

The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use “+m”.

None of this is anything new or innovative – GCC has had these semantics – and been optimizing based on them – for ages.

E.g., here, all elements of the array are replaced, so the initialization goes away, and the return needs to explicitly add all 4 values written by the inline-asm.
int out[4] = {1,2,3,4};
asm(“whatever” : “=m”(out));
return out[0] + out[1] + out[2] + out[3];

Here, only out[1] is touched by the inline asm. The other values are not modified, so all of the initialization can disappear, and the generated code can simply return 8 + out[1].
int out[4] = {1,2,3,4};
asm(“whatever” : “=m”(out[1]));
return out[0] + out[1] + out[2] + out[3];

The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use "+m".

Thanks for taking your time to explain this!
After re-reading the thread, I agree we need to make DSE work with the
cases where the constraints are correct, which will cover most of the
uses of inline assembly in the kernel.
Not dead-eliminating certain stores with incorrect constraints
shouldn't be a problem.

Turns out it's quite easy to make DSE work for a particular case of a
no-input-single-output asm() directive by patching
hasAnalyzableMemoryWrite() and getLocForWrite() in
DeadStoreElimination.cpp
But for directives that read and write more than one memory location
it might require some refactoring.
Vitaly (CCed) is already working on cross-basic-block DSE right now,
so I'd not touch the assembly handling before he lands his patches.

The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use "+m".

None of this is anything new or innovative -- GCC has had these semantics -- and been optimizing based on them -- for ages.

E.g., here, all elements of the array are replaced, so the initialization goes away, and the return needs to explicitly add all 4 values written by the inline-asm.
int out[4] = {1,2,3,4};
asm("whatever" : "=m"(out));
return out[0] + out[1] + out[2] + out[3];

Here, only out[1] is touched by the inline asm. The other values are not modified, so all of the initialization can disappear, and the generated code can simply return 8 + out[1].
int out[4] = {1,2,3,4};
asm("whatever" : "=m"(out[1]));
return out[0] + out[1] + out[2] + out[3];

Thanks!

What exactly do you mean by a named object? out[1] does not refer to
any named object, right? Or *(a+i).

How should be a memset-like function described that writes multiple
(unknown) number of elements?

What if we need to pass a pointer to beginning of an array, and the
asm block writes to i-th (unknown) element of the array?

What if we pass a pointer to int but actually write 6 or 32 bytes at
that address?

The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use “+m”.

Thanks for taking your time to explain this!
After re-reading the thread, I agree we need to make DSE work with the
cases where the constraints are correct, which will cover most of the
uses of inline assembly in the kernel.
Not dead-eliminating certain stores with incorrect constraints
shouldn’t be a problem.

Turns out it’s quite easy to make DSE work for a particular case of a
no-input-single-output asm() directive by patching
hasAnalyzableMemoryWrite() and getLocForWrite() in
DeadStoreElimination.cpp
But for directives that read and write more than one memory location
it might require some refactoring.
Vitaly (CCed) is already working on cross-basic-block DSE right now,
so I’d not touch the assembly handling before he lands his patches.

Thanks! This all sounds great. Please CC me on the patches, I’d like to make sure it works well for our code as well (potentially in follow-ups I’d send).

>
> The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use "+m".
>
> None of this is anything new or innovative -- GCC has had these semantics -- and been optimizing based on them -- for ages.
>
> E.g., here, all elements of the array are replaced, so the initialization goes away, and the return needs to explicitly add all 4 values written by the inline-asm.
> int out[4] = {1,2,3,4};
> asm("whatever" : "=m"(out));
> return out[0] + out[1] + out[2] + out[3];
>
> Here, only out[1] is touched by the inline asm. The other values are not modified, so all of the initialization can disappear, and the generated code can simply return 8 + out[1].
> int out[4] = {1,2,3,4};
> asm("whatever" : "=m"(out[1]));
> return out[0] + out[1] + out[2] + out[3];

Thanks!

What exactly do you mean by a named object? out[1] does not refer to
any named object, right? Or *(a+i).

How should be a memset-like function described that writes multiple
(unknown) number of elements?

What if we need to pass a pointer to beginning of an array, and the
asm block writes to i-th (unknown) element of the array?

What if we pass a pointer to int but actually write 6 or 32 bytes at
that address?

I am asking because I am seeing all of these cases in the kernel code.
So I am trying to understand (1) if it has format semantics and what
are they (2) if it's precisely analyzable (3) if developers are aware
and respect these semantics, or we need to allocate another year for
fixing incorrect code if we go this route.

James, what do you think about this particular case
(https://godbolt.org/z/Vl2bst)?

The entirety of the named object is replaced. If you want to modify an object, instead of entirely replacing it, you use “+m”.

None of this is anything new or innovative – GCC has had these semantics – and been optimizing based on them – for ages.

E.g., here, all elements of the array are replaced, so the initialization goes away, and the return needs to explicitly add all 4 values written by the inline-asm.
int out[4] = {1,2,3,4};
asm(“whatever” : “=m”(out));
return out[0] + out[1] + out[2] + out[3];

Here, only out[1] is touched by the inline asm. The other values are not modified, so all of the initialization can disappear, and the generated code can simply return 8 + out[1].
int out[4] = {1,2,3,4};
asm(“whatever” : “=m”(out[1]));
return out[0] + out[1] + out[2] + out[3];

Thanks!

What exactly do you mean by a named object? out[1] does not refer to
any named object, right? Or *(a+i).

How should be a memset-like function described that writes multiple
(unknown) number of elements?

What if we need to pass a pointer to beginning of an array, and the
asm block writes to i-th (unknown) element of the array?

What if we pass a pointer to int but actually write 6 or 32 bytes at
that address?

I am asking because I am seeing all of these cases in the kernel code.
So I am trying to understand (1) if it has format semantics and what
are they (2) if it’s precisely analyzable (3) if developers are aware
and respect these semantics, or we need to allocate another year for
fixing incorrect code if we go this route.

James, what do you think about this particular case
(https://godbolt.org/z/Vl2bst)?

=======================
void clear_bit(long nr, volatile unsigned long addr) {
asm volatile(“lock; btr %1,%0”
: “+m”(
(volatile long *)addr)
: “Ir” (nr));
}
unsigned long foo() {
unsigned long addr[2] = {1,2};
clear_bit(65, addr);
return addr[0] + addr[1];
}

The declaration of clear_bit() is taken from the Linux kernel
(https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/bitops.h#L111).
It however appears to be incorrect: GCC assumes that only addr[0] can
be overwritten by the inline assembly, whereas the call actually
touches addr[1].
Is this the expected behavior?

Yes, this is the expected behavior of the above code. Which is to say: yes, this code is broken.

We can fix the situation by adding the “memory” clobber to the asm()
directive, but maybe there’s a more elegant way to tell the compiler
we’re potentially touching any byte of the array?

There is no “any byte of the array” here, because you passed it a value of type “long”. If you passed the asm a value of array type, it would treat it as touching any byte of the array. But, in this case, there’s really no way of knowing the intended size of the data pointed to by the pointer, so that’s not workable. Given that you’re passing a pointer to unsized data, I would write this instead as simply taking the address and using a memory clobber.

That is:
asm volatile(“lock; btr %1,%0”

:
: “r”(addr), “Ir” (nr)
: “memory”);

So far it looks weird.
I have benchmark which shows -1.5% performance regression with -ftrivial-auto-var-init of trunk clang.

My DSE prototype, cross block only, removes additional +10% stores comparing to existing DSE. Obviously cross-block is not enough, and we need to do inter-procedural module analysis or even LTO.

However problem is following.
To my surprise, if I completely disable existing DSE (and my), I see no difference on benchmarks at all. Same (no difference) with or without -ftrivial-auto-var-init.