Powerpc Linux 'scv' system call ABI proposal take 2

I would like to enable Linux support for the powerpc 'scv' instruction,
as a faster system call instruction.

This requires two things to be defined: Firstly a way to advertise to
userspace that kernel supports scv, and a way to allocate and advertise
support for individual scv vectors. Secondly, a calling convention ABI
for this new instruction.

Thanks to those who commented last time, since then I have removed my
answered questions and unpopular alternatives but you can find them
here

https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html

Let me try one more with a wider cc list, and then we'll get something
merged. Any questions or counter-opinions are welcome.

System Call Vectored (scv) ABI

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention. Then if the kernel doesn't
provide it (because the kernel is too old) libc would have to provide
its own stub that uses the legacy method and matches the calling
convention of the one the kernel is expected to provide.

Note that any libc that actually makes use of the new functionality is
not going to be able to make clobbers conditional on support for it;
branching around different clobbers is going to defeat any gains vs
always just treating anything clobbered by either method as clobbered.
Likewise, it's not useful to have different error return mechanisms
because the caller just has to branch to support both (or the
kernel-provided stub just has to emulate one for it; that could work
if you really want to change the bad existing convention).

Thoughts?

Rich

Excerpts from Rich Felker's message of April 16, 2020 8:55 am:

I would like to enable Linux support for the powerpc 'scv' instruction,
as a faster system call instruction.

This requires two things to be defined: Firstly a way to advertise to
userspace that kernel supports scv, and a way to allocate and advertise
support for individual scv vectors. Secondly, a calling convention ABI
for this new instruction.

Thanks to those who commented last time, since then I have removed my
answered questions and unpopular alternatives but you can find them
here

https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html

Let me try one more with a wider cc list, and then we'll get something
merged. Any questions or counter-opinions are welcome.

System Call Vectored (scv) ABI

The scv instruction is introduced with POWER9 / ISA3, it comes with an
rfscv counter-part. The benefit of these instructions is performance
(trading slower SRR0/1 with faster LR/CTR registers, and entering the
kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR
updates. The scv instruction has 128 interrupt entry points (not enough
to cover the Linux system call space).

The proposal is to assign scv numbers very conservatively and allocate
them as individual HWCAP features as we add support for more. The zero
vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.

Advertisement

Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
SIGILL in current environments. Linux has defined a HWCAP2 bit
PPC_FEATURE2_SCV for SCV support, but does not set it.

When scv instruction support and the scv 0 vector for system calls are
added, PPC_FEATURE2_SCV will indicate support for these. Other vectors
should not be used without future HWCAP bits indicating support, which is
how we will allocate them. (Should unallocated ones generate SIGILL, or
return -ENOSYS in r3?)

Calling convention

The proposal is for scv 0 to provide the standard Linux system call ABI
with the following differences from sc convention[1]:

- LR is to be volatile across scv calls. This is necessary because the
  scv instruction clobbers LR. From previous discussion, this should be
  possible to deal with in GCC clobbers and CFI.

- CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
  kernel system call exit to avoid restoring the CR register (although
  we probably still would anyway to avoid information leak).

- Error handling: I think the consensus has been to move to using negative
  return value in r3 rather than CR0[SO]=1 to indicate error, which matches
  most other architectures and is closer to a function call.

The number of scratch registers (r9-r12) at kernel entry seems
sufficient that we don't have any costly spilling, patch is here[2].

[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention. Then if the kernel doesn't
provide it (because the kernel is too old) libc would have to provide
its own stub that uses the legacy method and matches the calling
convention of the one the kernel is expected to provide.

I'm not sure if that's necessary. That's done on x86-32 because they
select different sequences to use based on the CPU running and if the host
kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
bits and select the right sequence in libc as well I suppose.

Note that any libc that actually makes use of the new functionality is
not going to be able to make clobbers conditional on support for it;
branching around different clobbers is going to defeat any gains vs
always just treating anything clobbered by either method as clobbered.

Well it would have to test HWCAP and patch in or branch to two
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

A bit of logic to select between them doesn't defeat gains though,
it's about 90 cycle improvement which is a handful of branch mispredicts
so it really is an improvement. Eventually userspace will stop
supporting the old variant too.

Likewise, it's not useful to have different error return mechanisms
because the caller just has to branch to support both (or the
kernel-provided stub just has to emulate one for it; that could work
if you really want to change the bad existing convention).

Thoughts?

The existing convention has to change somewhat because of the clobbers,
so I thought we could change the error return at the same time. I'm
open to not changing it and using CR0[SO], but others liked the idea.
Pro: it matches sc and vsyscall. Con: it's different from other common
archs. Performnce-wise it would really be a wash -- cost of conditional
branch is not the cmp but the mispredict.

Thanks,
Nick

Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>>
>> This requires two things to be defined: Firstly a way to advertise to
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>>
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>>
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>>
>> System Call Vectored (scv) ABI
>> ==============================
>>
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR
>> updates. The scv instruction has 128 interrupt entry points (not enough
>> to cover the Linux system call space).
>>
>> The proposal is to assign scv numbers very conservatively and allocate
>> them as individual HWCAP features as we add support for more. The zero
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>>
>> Advertisement
>>
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>>
>> When scv instruction support and the scv 0 vector for system calls are
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>>
>> Calling convention
>>
>> The proposal is for scv 0 to provide the standard Linux system call ABI
>> with the following differences from sc convention[1]:
>>
>> - LR is to be volatile across scv calls. This is necessary because the
>> scv instruction clobbers LR. From previous discussion, this should be
>> possible to deal with in GCC clobbers and CFI.
>>
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>> kernel system call exit to avoid restoring the CR register (although
>> we probably still would anyway to avoid information leak).
>>
>> - Error handling: I think the consensus has been to move to using negative
>> return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>> most other architectures and is closer to a function call.
>>
>> The number of scratch registers (r9-r12) at kernel entry seems
>> sufficient that we don't have any costly spilling, patch is here[2].
>>
>> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
>
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

I'm not sure if that's necessary. That's done on x86-32 because they
select different sequences to use based on the CPU running and if the host
kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
bits and select the right sequence in libc as well I suppose.

It's not just a HWCAP. It's a contract between the kernel and
userspace to support a particular calling convention that's not
exposed except as the public entry point the kernel exports via
AT_SYSINFO.

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.

Well it would have to test HWCAP and patch in or branch to two
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

A bit of logic to select between them doesn't defeat gains though,
it's about 90 cycle improvement which is a handful of branch mispredicts
so it really is an improvement. Eventually userspace will stop
supporting the old variant too.

Oh, I didn't mean it would neutralize the benefit of svc. Rather, I
meant it would be worse to do:

  if (hwcap & X) {
    __asm__(... with some clobbers);
  } else {
    __asm__(... with different clobbers);
  }

instead of just

  __asm__("indirect call" ... with common clobbers);

where the indirect call is to an address ideally provided like on
i386, or otherwise initialized to one of two or more code addresses in
libc based on hwcap bits.

> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided stub just has to emulate one for it; that could work
> if you really want to change the bad existing convention).
>
> Thoughts?

The existing convention has to change somewhat because of the clobbers,
so I thought we could change the error return at the same time. I'm
open to not changing it and using CR0[SO], but others liked the idea.
Pro: it matches sc and vsyscall. Con: it's different from other common
archs. Performnce-wise it would really be a wash -- cost of conditional
branch is not the cmp but the mispredict.

If you do the branch on hwcap at each syscall, then you significantly
increase code size of every syscall point, likely turning a bunch of
trivial functions that didn't need stack frames into ones that do. You
also potentially make them need a TOC pointer. Making them all just do
an indirect call unconditionally (with pointer in TLS like i386?) is a
lot more efficient in code size and at least as good for performance.

Rich

Excerpts from Rich Felker's message of April 16, 2020 10:48 am:

Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>>
>> This requires two things to be defined: Firstly a way to advertise to
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>>
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>>
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>>
>> System Call Vectored (scv) ABI
>> ==============================
>>
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR
>> updates. The scv instruction has 128 interrupt entry points (not enough
>> to cover the Linux system call space).
>>
>> The proposal is to assign scv numbers very conservatively and allocate
>> them as individual HWCAP features as we add support for more. The zero
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>>
>> Advertisement
>>
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>>
>> When scv instruction support and the scv 0 vector for system calls are
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>>
>> Calling convention
>>
>> The proposal is for scv 0 to provide the standard Linux system call ABI
>> with the following differences from sc convention[1]:
>>
>> - LR is to be volatile across scv calls. This is necessary because the
>> scv instruction clobbers LR. From previous discussion, this should be
>> possible to deal with in GCC clobbers and CFI.
>>
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>> kernel system call exit to avoid restoring the CR register (although
>> we probably still would anyway to avoid information leak).
>>
>> - Error handling: I think the consensus has been to move to using negative
>> return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>> most other architectures and is closer to a function call.
>>
>> The number of scratch registers (r9-r12) at kernel entry seems
>> sufficient that we don't have any costly spilling, patch is here[2].
>>
>> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
>
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

I'm not sure if that's necessary. That's done on x86-32 because they
select different sequences to use based on the CPU running and if the host
kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
bits and select the right sequence in libc as well I suppose.

It's not just a HWCAP. It's a contract between the kernel and
userspace to support a particular calling convention that's not
exposed except as the public entry point the kernel exports via
AT_SYSINFO.

Right.

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.

Well it would have to test HWCAP and patch in or branch to two
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

A bit of logic to select between them doesn't defeat gains though,
it's about 90 cycle improvement which is a handful of branch mispredicts
so it really is an improvement. Eventually userspace will stop
supporting the old variant too.

Oh, I didn't mean it would neutralize the benefit of svc. Rather, I
meant it would be worse to do:

  if (hwcap & X) {
    __asm__(... with some clobbers);
  } else {
    __asm__(... with different clobbers);
  }

instead of just

  __asm__("indirect call" ... with common clobbers);

Ah okay. Well that's debatable but if you didn't have an indirect call,
rather a runtime-patched sequence, then yes saving the LR clobber or
whatever wouldn't be worth a branch.

where the indirect call is to an address ideally provided like on
i386, or otherwise initialized to one of two or more code addresses in
libc based on hwcap bits.

Right, I'm just skeptical we need the indirect call or need to provide
it in the vdso. The "clever" reason to add it on x86-32 was because of
the bugs and different combinations needed, that doesn't really apply
to scv 0 and was not necessarily a great choice.

> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided stub just has to emulate one for it; that could work
> if you really want to change the bad existing convention).
>
> Thoughts?

The existing convention has to change somewhat because of the clobbers,
so I thought we could change the error return at the same time. I'm
open to not changing it and using CR0[SO], but others liked the idea.
Pro: it matches sc and vsyscall. Con: it's different from other common
archs. Performnce-wise it would really be a wash -- cost of conditional
branch is not the cmp but the mispredict.

If you do the branch on hwcap at each syscall, then you significantly
increase code size of every syscall point, likely turning a bunch of
trivial functions that didn't need stack frames into ones that do. You
also potentially make them need a TOC pointer. Making them all just do
an indirect call unconditionally (with pointer in TLS like i386?) is a
lot more efficient in code size and at least as good for performance.

I disagree. Doing the long vdso indirect call *necessarily* requires
touching a new icache line, and even a new TLB entry. Indirect branches
also tend to be more costly and/or less accurate to predict than
direct even without spectre (generally fewer indirect predictor entries
than direct, far branches in particular require a lot of bits for
target). And with spectre we're flushing the indirect predictors
on context switch or even disabling indirect prediction or flushing
across privilege domains in the same context.

And finally, the HWCAP test can eventually go away in future. A vdso
call can not.

If you really want to select with an indirect branch rather than
direct conditional, you can do that all within the library.

Thanks,
Nick

>> > Likewise, it's not useful to have different error return mechanisms
>> > because the caller just has to branch to support both (or the
>> > kernel-provided stub just has to emulate one for it; that could work
>> > if you really want to change the bad existing convention).
>> >
>> > Thoughts?
>>
>> The existing convention has to change somewhat because of the clobbers,
>> so I thought we could change the error return at the same time. I'm
>> open to not changing it and using CR0[SO], but others liked the idea.
>> Pro: it matches sc and vsyscall. Con: it's different from other common
>> archs. Performnce-wise it would really be a wash -- cost of conditional
>> branch is not the cmp but the mispredict.
>
> If you do the branch on hwcap at each syscall, then you significantly
> increase code size of every syscall point, likely turning a bunch of
> trivial functions that didn't need stack frames into ones that do. You
> also potentially make them need a TOC pointer. Making them all just do
> an indirect call unconditionally (with pointer in TLS like i386?) is a
> lot more efficient in code size and at least as good for performance.

I disagree. Doing the long vdso indirect call *necessarily* requires
touching a new icache line, and even a new TLB entry. Indirect branches

The increase in number of icache lines from the branch at every
syscall point is far greater than the use of a single extra icache
line shared by all syscalls. Not to mention the dcache line to access
__hwcap or whatever, and the icache lines to setup access TOC-relative
access to it. (Of course you could put a copy of its value in TLS at a
fixed offset, which would somewhat mitigate both.)

And finally, the HWCAP test can eventually go away in future. A vdso
call can not.

We support nearly arbitrarily old kernels (with limited functionality)
and hardware (with full functionality) and don't intend for that to
change, ever. But indeed glibc might want too eventually drop the
check.

If you really want to select with an indirect branch rather than
direct conditional, you can do that all within the library.

OK. It's a little bit more work if that's not the interface the kernel
will give us, but it's no big deal.

Rich

Excerpts from Rich Felker's message of April 16, 2020 12:35 pm:

>> > Likewise, it's not useful to have different error return mechanisms
>> > because the caller just has to branch to support both (or the
>> > kernel-provided stub just has to emulate one for it; that could work
>> > if you really want to change the bad existing convention).
>> >
>> > Thoughts?
>>
>> The existing convention has to change somewhat because of the clobbers,
>> so I thought we could change the error return at the same time. I'm
>> open to not changing it and using CR0[SO], but others liked the idea.
>> Pro: it matches sc and vsyscall. Con: it's different from other common
>> archs. Performnce-wise it would really be a wash -- cost of conditional
>> branch is not the cmp but the mispredict.
>
> If you do the branch on hwcap at each syscall, then you significantly
> increase code size of every syscall point, likely turning a bunch of
> trivial functions that didn't need stack frames into ones that do. You
> also potentially make them need a TOC pointer. Making them all just do
> an indirect call unconditionally (with pointer in TLS like i386?) is a
> lot more efficient in code size and at least as good for performance.

I disagree. Doing the long vdso indirect call *necessarily* requires
touching a new icache line, and even a new TLB entry. Indirect branches

The increase in number of icache lines from the branch at every
syscall point is far greater than the use of a single extra icache
line shared by all syscalls.

That's true, I was thinking of a single function that does the test and
calls syscalls, which might be the fair comparison.

Not to mention the dcache line to access
__hwcap or whatever, and the icache lines to setup access TOC-relative
access to it. (Of course you could put a copy of its value in TLS at a
fixed offset, which would somewhat mitigate both.)

And finally, the HWCAP test can eventually go away in future. A vdso
call can not.

We support nearly arbitrarily old kernels (with limited functionality)
and hardware (with full functionality) and don't intend for that to
change, ever. But indeed glibc might want too eventually drop the
check.

Ah, cool. Any build-time flexibility there?

We may or may not be getting a new ABI that will use instructions not
supported by old processors.

https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html

Current ABI continues to work of course and be the default for some
time, but building for new one would give some opportunity to drop
such support for old procs, at least for glibc.

If you really want to select with an indirect branch rather than
direct conditional, you can do that all within the library.

OK. It's a little bit more work if that's not the interface the kernel
will give us, but it's no big deal.

Okay.

Thanks,
Nick

What does "new ABI" entail to you? In the terminology I use with musl,
"new ABI" and "new ISA level" are different things. You can compile
(explicit -march or compiler default) binaries that won't run on older
cpus due to use of new insns etc., but we consider it the same ABI if
you can link code for an older/baseline ISA level with the
newer-ISA-level object files, i.e. if the interface surface for
linkage remains compatible. We also try to avoid gratuitous
proliferation of different ABIs unless there's a strong underlying
need (like addition of softfloat ABIs for archs that usually have FPU,
or vice versa).

In principle the same could be done for kernels except it's a bigger
silent gotcha (possible ENOSYS in places where it shouldn't be able to
happen rather than a trapping SIGILL or similar) and there's rarely
any serious performance or size benefit to dropping support for older
kernels.

Rich

Excerpts from Rich Felker's message of April 16, 2020 1:03 pm:

> Not to mention the dcache line to access
> __hwcap or whatever, and the icache lines to setup access TOC-relative
> access to it. (Of course you could put a copy of its value in TLS at a
> fixed offset, which would somewhat mitigate both.)
>
>> And finally, the HWCAP test can eventually go away in future. A vdso
>> call can not.
>
> We support nearly arbitrarily old kernels (with limited functionality)
> and hardware (with full functionality) and don't intend for that to
> change, ever. But indeed glibc might want too eventually drop the
> check.

Ah, cool. Any build-time flexibility there?

We may or may not be getting a new ABI that will use instructions not
supported by old processors.

https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html

Current ABI continues to work of course and be the default for some
time, but building for new one would give some opportunity to drop
such support for old procs, at least for glibc.

What does "new ABI" entail to you? In the terminology I use with musl,
"new ABI" and "new ISA level" are different things. You can compile
(explicit -march or compiler default) binaries that won't run on older
cpus due to use of new insns etc., but we consider it the same ABI if
you can link code for an older/baseline ISA level with the
newer-ISA-level object files, i.e. if the interface surface for
linkage remains compatible. We also try to avoid gratuitous
proliferation of different ABIs unless there's a strong underlying
need (like addition of softfloat ABIs for archs that usually have FPU,
or vice versa).

Yeah it will be a new ABI type that also requires a new ISA level.
As far as I know (and I'm not on the toolchain side) there will be
some call compatibility between the two, so it may be fine to
continue with existing ABI for libc. But it just something that
comes to mind as a build-time cutover where we might be able to
assume particular features.

In principle the same could be done for kernels except it's a bigger
silent gotcha (possible ENOSYS in places where it shouldn't be able to
happen rather than a trapping SIGILL or similar) and there's rarely
any serious performance or size benefit to dropping support for older
kernels.

Right, I don't think it'd be a huge problem whatever way we go,
compared with the cost of the system call.

Thanks,
Nick

* Rich Felker:

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention.

The i386 mechanism has received some criticism because it provides an
effective means to redirect execution flow to anyone who can write to
the TCB. I am not sure if it makes sense to copy it.

* Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]:

Well it would have to test HWCAP and patch in or branch to two
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

how would that 'patch' work?

there are many reasons why you don't
want libc to write its .text

I would like to enable Linux support for the powerpc 'scv' instruction,
as a faster system call instruction.

This requires two things to be defined: Firstly a way to advertise to
userspace that kernel supports scv, and a way to allocate and advertise
support for individual scv vectors. Secondly, a calling convention ABI
for this new instruction.

Thanks to those who commented last time, since then I have removed my
answered questions and unpopular alternatives but you can find them
here

https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html

Let me try one more with a wider cc list, and then we'll get something
merged. Any questions or counter-opinions are welcome.

System Call Vectored (scv) ABI

The scv instruction is introduced with POWER9 / ISA3, it comes with an
rfscv counter-part. The benefit of these instructions is performance
(trading slower SRR0/1 with faster LR/CTR registers, and entering the
kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR
updates. The scv instruction has 128 interrupt entry points (not enough
to cover the Linux system call space).

The proposal is to assign scv numbers very conservatively and allocate
them as individual HWCAP features as we add support for more. The zero
vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.

Advertisement

Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
SIGILL in current environments. Linux has defined a HWCAP2 bit
PPC_FEATURE2_SCV for SCV support, but does not set it.

When scv instruction support and the scv 0 vector for system calls are
added, PPC_FEATURE2_SCV will indicate support for these. Other vectors
should not be used without future HWCAP bits indicating support, which is
how we will allocate them. (Should unallocated ones generate SIGILL, or
return -ENOSYS in r3?)

Calling convention

The proposal is for scv 0 to provide the standard Linux system call ABI
with the following differences from sc convention[1]:

- LR is to be volatile across scv calls. This is necessary because the
  scv instruction clobbers LR. From previous discussion, this should be
  possible to deal with in GCC clobbers and CFI.

- CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
  kernel system call exit to avoid restoring the CR register (although
  we probably still would anyway to avoid information leak).

- Error handling: I think the consensus has been to move to using negative
  return value in r3 rather than CR0[SO]=1 to indicate error, which matches
  most other architectures and is closer to a function call.

The number of scratch registers (r9-r12) at kernel entry seems
sufficient that we don't have any costly spilling, patch is here[2].

[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention. Then if the kernel doesn't
provide it (because the kernel is too old) libc would have to provide
its own stub that uses the legacy method and matches the calling
convention of the one the kernel is expected to provide.

What about pthread cancellation and the requirement of checking the
cancellable syscall anchors in asynchronous cancellation? My plan is
still to use musl strategy on glibc (BZ#12683) and for i686 it
requires to always use old int$128 for program that uses cancellation
(static case) or just threads (dynamic mode, which should be more
common on glibc).

Using the i686 strategy of a vDSO bridge symbol would require to always
fallback to 'sc' to still use the same cancellation strategy (and
thus defeating this optimization in such cases).

Could GCC function multiversioning work here?
https://gcc.gnu.org/wiki/FunctionMultiVersioning

It seems like selecting a runtime version of a function is the sort of
thing you are trying to do.

Jeff

Indeed that's a good point. Do you have ideas for making it equally
efficient without use of a function pointer in the TCB?

Rich

Yes, I assumed it would be the same, ignoring the new syscall
mechanism for cancellable syscalls. While there are some exceptions,
cancellable syscalls are generally not hot paths but things that are
expected to block and to have significant amounts of work to do in
kernelspace, so saving a few tens of cycles is rather pointless.

It's possible to do a branch/multiple versions of the syscall asm for
cancellation but would require extending the cancellation handler to
support checking against multiple independent address ranges or using
some alternate markup of them.

Rich

On glibc it potentially could. This is ifunc-based functionality
though and musl explicitly does not (and will not) support ifunc
because of lots of fundamental problems it entails. But even on glibc
the underlying mechanisms for ifunc are just the same as a normal
indirect call and there's no real reason to prefer implementing it
with ifunc/multiversioning vs directly.

Rich

* Rich Felker:

* Rich Felker:

> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention.

The i386 mechanism has received some criticism because it provides an
effective means to redirect execution flow to anyone who can write to
the TCB. I am not sure if it makes sense to copy it.

Indeed that's a good point. Do you have ideas for making it equally
efficient without use of a function pointer in the TCB?

We could add a shared non-writable mapping at a 64K offset from the
thread pointer and store the function pointer or the code there. Then
it would be safe.

However, since this is apparently tied to POWER9 and we already have a
POWER9 multilib, and assuming that we are going to backport the kernel
change, I would tweak the selection criterion for that multilib to
include the new HWCAP2 flag. If a user runs this glibc on a kernel
which does not have support, they will get set baseline (POWER8)
multilib, which still works. This way, outside the dynamic loader, no
run-time dispatch is needed at all. I guess this is not at all the
answer you were looking for. :sunglasses:

If a single binary is needed, I would perhaps follow what Arm did for
-moutline-atomics: lay out the code so that its easy to execute for
the non-POWER9 case, assuming that POWER9 machines will be better at
predicting things than their predecessors.

Or you could also put the function pointer into a RELRO segment. Then
there's overlap with the __libc_single_threaded discussion, where
people objected to this kind of optimization (although I did not
propose to change the TCB ABI, that would be required for
__libc_single_threaded because it's an external interface).

* Rich Felker:

>> * Rich Felker:
>>
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention.
>>
>> The i386 mechanism has received some criticism because it provides an
>> effective means to redirect execution flow to anyone who can write to
>> the TCB. I am not sure if it makes sense to copy it.
>
> Indeed that's a good point. Do you have ideas for making it equally
> efficient without use of a function pointer in the TCB?

We could add a shared non-writable mapping at a 64K offset from the
thread pointer and store the function pointer or the code there. Then
it would be safe.

However, since this is apparently tied to POWER9 and we already have a
POWER9 multilib, and assuming that we are going to backport the kernel
change, I would tweak the selection criterion for that multilib to
include the new HWCAP2 flag. If a user runs this glibc on a kernel
which does not have support, they will get set baseline (POWER8)
multilib, which still works. This way, outside the dynamic loader, no
run-time dispatch is needed at all. I guess this is not at all the
answer you were looking for. :sunglasses:

How does this work with -static? :slight_smile:

If a single binary is needed, I would perhaps follow what Arm did for
-moutline-atomics: lay out the code so that its easy to execute for
the non-POWER9 case, assuming that POWER9 machines will be better at
predicting things than their predecessors.

Or you could also put the function pointer into a RELRO segment. Then
there's overlap with the __libc_single_threaded discussion, where
people objected to this kind of optimization (although I did not
propose to change the TCB ABI, that would be required for
__libc_single_threaded because it's an external interface).

Of course you can use a normal global, but now every call point needs
to setup a TOC pointer (= two entry points and more icache lines for
otherwise trivial functions).

I think my choice would be just making the inline syscall be a single
call insn to an asm source file that out-of-lines the loading of TOC
pointer and call through it or branch based on hwcap so that it's not
repeated all over the place.

Alternatively, it would perhaps work to just put hwcap in the TCB and
branch on it rather than making an indirect call to a function pointer
in the TCB, so that the worst you could do by clobbering it is execute
the wrong syscall insn and thereby get SIGILL.

Rich

The main issue is at least for glibc dynamic linking is way more common
than static linking and once the program become multithread the fallback
will be always used.

And besides the cancellation performance issue, a new bridge vDSO mechanism
will still require to setup some extra bridge for the case of the older
kernel. In the scheme you suggested:

  __asm__("indirect call" ... with common clobbers);

The indirect call will be either the vDSO bridge or an libc provided that
fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
against:

   if (hwcap & PPC_FEATURE2_SCV) {
     __asm__(... with some clobbers);
   } else {
     __asm__(... with different clobbers);
   }

Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a
TCB member (as we do on glibc) and if we could make the asm clever
enough to not require different clobbers (although not sure if
it would be possible).

>>> My preference would be that it work just like the i386 AT_SYSINFO
>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>> provides a stub in the vdso that performs either scv or the old
>>> mechanism with the same calling convention. Then if the kernel doesn't
>>> provide it (because the kernel is too old) libc would have to provide
>>> its own stub that uses the legacy method and matches the calling
>>> convention of the one the kernel is expected to provide.
>>
>> What about pthread cancellation and the requirement of checking the
>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> still to use musl strategy on glibc (BZ#12683) and for i686 it
>> requires to always use old int$128 for program that uses cancellation
>> (static case) or just threads (dynamic mode, which should be more
>> common on glibc).
>>
>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> fallback to 'sc' to still use the same cancellation strategy (and
>> thus defeating this optimization in such cases).
>
> Yes, I assumed it would be the same, ignoring the new syscall
> mechanism for cancellable syscalls. While there are some exceptions,
> cancellable syscalls are generally not hot paths but things that are
> expected to block and to have significant amounts of work to do in
> kernelspace, so saving a few tens of cycles is rather pointless.
>
> It's possible to do a branch/multiple versions of the syscall asm for
> cancellation but would require extending the cancellation handler to
> support checking against multiple independent address ranges or using
> some alternate markup of them.

The main issue is at least for glibc dynamic linking is way more common
than static linking and once the program become multithread the fallback
will be always used.

I'm not relying on static linking optimizing out the cancellable
version. I'm talking about how cancellable syscalls are pretty much
all "heavy" operations to begin with where a few tens of cycles are in
the realm of "measurement noise" relative to the dominating time
costs.

And besides the cancellation performance issue, a new bridge vDSO mechanism
will still require to setup some extra bridge for the case of the older
kernel. In the scheme you suggested:

  __asm__("indirect call" ... with common clobbers);

The indirect call will be either the vDSO bridge or an libc provided that
fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
against:

   if (hwcap & PPC_FEATURE2_SCV) {
     __asm__(... with some clobbers);
   } else {
     __asm__(... with different clobbers);
   }

If the indirect call can be made roughly as efficiently as the sc
sequence now (which already have some cost due to handling the nasty
error return convention, making the indirect call likely just as small
or smaller), it's O(1) additional code size (and thus icache usage)
rather than O(n) where n is number of syscall points.

Of course it would work just as well (for avoiding O(n) growth) to
have a direct call to out-of-line branch like you suggested.

Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a
TCB member (as we do on glibc) and if we could make the asm clever
enough to not require different clobbers (although not sure if
it would be possible).

The easy way not to require different clobbers is just using the union
of the clobbers, no? Does the proposed new method clobber any
call-saved registers that would make it painful (requiring new call
frames to save them in)?

Rich