Linux powerpc new system call instruction and ABI

Thanks to everyone who has given feedback on the proposed new system
call instruction and ABI, I think it has reached agreement and the
implementation can be merged into Linux.

I have a hacked glibc implementation (that doesn't do all the right
HWCAP detection and misses a few things) that I've tested several things
including some kernel selftests (involving signals and syscalls) with.

System Call Vectored (scv) ABI

The scv instruction causes an interrupt which can enter the kernel with
MSR[EE]=1, thus allowing interrupts to hit at any time. These must not
be taken as normal interrupts, because they come from MSR[PR]=0 context,
and yet the kernel stack is not yet set up and r13 is not set to the
PACA).

Treat this as a soft-masked interrupt regardless of the soft masked
state. This does not affect behaviour yet, because currently all
interrupts are taken with MSR[EE]=0.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Add support for the scv instruction on POWER9 and later CPUs.

For now this implements the zeroth scv vector 'scv 0', as identical to
'sc' system calls, with the exception that lr is not preserved, nor are
volatile cr registers, and error is not indicated with CR0[SO], but by
returning a negative errno.

rfscv is implemented to return from scv type system calls. It can not be
used to return from sc system calls because those are defined to
preserve lr.

getpid syscall throughput on POWER9 is improved by 26% (428 to 318
cycles), largely due to reducing mtmsr and mtspr.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Hi!

Calling convention
------------------
The proposal is for scv 0 to provide the standard Linux system call ABI
with the following differences from sc convention[1]:

- lr is to be volatile across scv calls. This is necessary because the
  scv instruction clobbers lr. From previous discussion, this should be
  possible to deal with in GCC clobbers and CFI.

- cr1 and cr5-cr7 are volatile. This matches the C ABI and would allow the
  kernel system call exit to avoid restoring the volatile cr registers
  (although we probably still would anyway to avoid information leaks).

- Error handling: The consensus among kernel, glibc, and musl is to move to
  using negative return values in r3 rather than CR0[SO]=1 to indicate error,
  which matches most other architectures, and is closer to a function call.

What about cr0 then? Will it be volatile as well (exactly like for
function calls)?

Notes
-----
- r0,r4-r8 are documented as volatile in the ABI, but the kernel patch as
  submitted currently preserves them. This is to leave room for deciding
  which way to go with these.

The kernel has to set it to *something* that doesn't leak information :wink:

Segher

Excerpts from Segher Boessenkool's message of June 12, 2020 7:02 am:

Hi!

Calling convention
------------------
The proposal is for scv 0 to provide the standard Linux system call ABI
with the following differences from sc convention[1]:

- lr is to be volatile across scv calls. This is necessary because the
  scv instruction clobbers lr. From previous discussion, this should be
  possible to deal with in GCC clobbers and CFI.

- cr1 and cr5-cr7 are volatile. This matches the C ABI and would allow the
  kernel system call exit to avoid restoring the volatile cr registers
  (although we probably still would anyway to avoid information leaks).

- Error handling: The consensus among kernel, glibc, and musl is to move to
  using negative return values in r3 rather than CR0[SO]=1 to indicate error,
  which matches most other architectures, and is closer to a function call.

What about cr0 then? Will it be volatile as well (exactly like for
function calls)?

Yes, same as for sc (except for SO bit). Which is a bit unclear in this
section.

Notes
-----
- r0,r4-r8 are documented as volatile in the ABI, but the kernel patch as
  submitted currently preserves them. This is to leave room for deciding
  which way to go with these.

The kernel has to set it to *something* that doesn't leak information :wink:

For "sc" system calls these were defined as volatile (and used to just
leak information), so now we just zero them.

Thanks,
Nick

Nicholas Piggin <npiggin@gmail.com> writes:

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 2a39c716c343..b2bdc4de1292 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -257,6 +257,7 @@
#define PPC_INST_MFVSRD 0x7c000066
#define PPC_INST_MTVSRD 0x7c000166
#define PPC_INST_SC 0x44000002
+#define PPC_INST_SCV 0x44000001

...

@@ -411,6 +412,7 @@

...

+#define __PPC_LEV(l) (((l) & 0x7f) << 5)

These conflicted and didn't seem to be used so I dropped them.

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5abe98216dc2..161bfccbc309 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3378,6 +3382,16 @@ int emulate_step(struct pt_regs *regs, struct ppc_inst instr)
     regs->msr = MSR_KERNEL;
     return 1;

+ case SYSCALL_VECTORED_0: /* scv 0 */
+ regs->gpr[9] = regs->gpr[13];
+ regs->gpr[10] = MSR_KERNEL;
+ regs->gpr[11] = regs->nip + 4;
+ regs->gpr[12] = regs->msr & MSR_MASK;
+ regs->gpr[13] = (unsigned long) get_paca();
+ regs->nip = (unsigned long) &system_call_vectored_emulate;
+ regs->msr = MSR_KERNEL;
+ return 1;
+

This broke the ppc64e build:

  ld: arch/powerpc/lib/sstep.o:(.toc+0x0): undefined reference to `system_call_vectored_emulate'
  make[1]: *** [/home/michael/linux/Makefile:1139: vmlinux] Error 1

I wrapped it in #ifdef CONFIG_PPC64_BOOK3S.

cheers

Michael Ellerman <mpe@ellerman.id.au> a écrit :

Nicholas Piggin <npiggin@gmail.com> writes:

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 2a39c716c343..b2bdc4de1292 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -257,6 +257,7 @@
#define PPC_INST_MFVSRD 0x7c000066
#define PPC_INST_MTVSRD 0x7c000166
#define PPC_INST_SC 0x44000002
+#define PPC_INST_SCV 0x44000001

...

@@ -411,6 +412,7 @@

...

+#define __PPC_LEV(l) (((l) & 0x7f) << 5)

These conflicted and didn't seem to be used so I dropped them.

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5abe98216dc2..161bfccbc309 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3378,6 +3382,16 @@ int emulate_step(struct pt_regs *regs, struct ppc_inst instr)
     regs->msr = MSR_KERNEL;
     return 1;

+ case SYSCALL_VECTORED_0: /* scv 0 */
+ regs->gpr[9] = regs->gpr[13];
+ regs->gpr[10] = MSR_KERNEL;
+ regs->gpr[11] = regs->nip + 4;
+ regs->gpr[12] = regs->msr & MSR_MASK;
+ regs->gpr[13] = (unsigned long) get_paca();
+ regs->nip = (unsigned long) &system_call_vectored_emulate;
+ regs->msr = MSR_KERNEL;
+ return 1;
+

This broke the ppc64e build:

  ld: arch/powerpc/lib/sstep.o:(.toc+0x0): undefined reference to `system_call_vectored_emulate'
  make[1]: *** [/home/michael/linux/Makefile:1139: vmlinux] Error 1

I wrapped it in #ifdef CONFIG_PPC64_BOOK3S.

You mean CONFIG_PPC_BOOK3S_64 ?

Christophe

Christophe Leroy <christophe.leroy@csgroup.eu> writes:

Michael Ellerman <mpe@ellerman.id.au> a écrit :

Nicholas Piggin <npiggin@gmail.com> writes:

diff --git a/arch/powerpc/include/asm/ppc-opcode.h
b/arch/powerpc/include/asm/ppc-opcode.h
index 2a39c716c343..b2bdc4de1292 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -257,6 +257,7 @@
#define PPC_INST_MFVSRD 0x7c000066
#define PPC_INST_MTVSRD 0x7c000166
#define PPC_INST_SC 0x44000002
+#define PPC_INST_SCV 0x44000001

...

@@ -411,6 +412,7 @@

...

+#define __PPC_LEV(l) (((l) & 0x7f) << 5)

These conflicted and didn't seem to be used so I dropped them.

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5abe98216dc2..161bfccbc309 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3378,6 +3382,16 @@ int emulate_step(struct pt_regs *regs,
struct ppc_inst instr)
     regs->msr = MSR_KERNEL;
     return 1;

+ case SYSCALL_VECTORED_0: /* scv 0 */
+ regs->gpr[9] = regs->gpr[13];
+ regs->gpr[10] = MSR_KERNEL;
+ regs->gpr[11] = regs->nip + 4;
+ regs->gpr[12] = regs->msr & MSR_MASK;
+ regs->gpr[13] = (unsigned long) get_paca();
+ regs->nip = (unsigned long) &system_call_vectored_emulate;
+ regs->msr = MSR_KERNEL;
+ return 1;
+

This broke the ppc64e build:

  ld: arch/powerpc/lib/sstep.o:(.toc+0x0): undefined reference to
`system_call_vectored_emulate'
  make[1]: *** [/home/michael/linux/Makefile:1139: vmlinux] Error 1

I wrapped it in #ifdef CONFIG_PPC64_BOOK3S.

You mean CONFIG_PPC_BOOK3S_64 ?

I hope so ...

#### ## ####.

Will send a fixup. Thanks for noticing.

cheers

Applied to powerpc/next.

[1/2] powerpc/64s/exception: treat NIA below __end_interrupts as soft-masked
      https://git.kernel.org/powerpc/c/b2dc2977cba48990df45e0a96150663d4f342700
[2/2] powerpc/64s: system call support for scv/rfscv instructions
      https://git.kernel.org/powerpc/c/7fa95f9adaee7e5cbb195d3359741120829e488b

cheers

Hi,

[...]

- Error handling: The consensus among kernel, glibc, and musl is to move to
  using negative return values in r3 rather than CR0[SO]=1 to indicate error,
  which matches most other architectures, and is closer to a function call.

Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
This includes syscall_get_error() and all its users including
PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
when scv is used.

See also https://bugzilla.redhat.com/1929836

Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am:

Hi,

[...]

- Error handling: The consensus among kernel, glibc, and musl is to move to
  using negative return values in r3 rather than CR0[SO]=1 to indicate error,
  which matches most other architectures, and is closer to a function call.

Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
This includes syscall_get_error() and all its users including
PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
when scv is used.

See also https://bugzilla.redhat.com/1929836

I see, thanks. Using latest strace from github.com, the attached kernel
patch makes strace -k check results a lot greener.

Some of the remaining failing tests look like this (I didn't look at all
of them yet):

signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
) = 50
signal(SIGUSR1, SIG_IGN) = 0xfacefeeddeadbeef
write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown errno 559038737) = 41
write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737
) = 26
exit_group(1) = ?

I think the problem is glibc testing for -ve, but it should be comparing
against -4095 (+cc Matheus)

  #define RET_SCV \
      cmpdi r3,0; \
      bgelr+; \
      neg r3,r3;

With this patch, I think the ptrace ABI should mostly be fixed. I think
a problem remains with applications that look at system call return
registers directly and have powerpc specific error cases. Those probably
will just need to be updated unfortunately. Michael thought it might be
possible to return an indication via ptrace somehow that the syscall is
using a new ABI, so such apps can be updated to test for it. I don't
know how that would be done.

Thanks,
Nick

Excerpts from Nicholas Piggin's message of May 19, 2021 12:50 pm:

Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am:

Hi,

[...]

- Error handling: The consensus among kernel, glibc, and musl is to move to
  using negative return values in r3 rather than CR0[SO]=1 to indicate error,
  which matches most other architectures, and is closer to a function call.

Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
This includes syscall_get_error() and all its users including
PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
when scv is used.

See also https://bugzilla.redhat.com/1929836

I see, thanks. Using latest strace from github.com, the attached kernel
patch makes strace -k check results a lot greener.

Some of the remaining failing tests look like this (I didn't look at all
of them yet):

signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
) = 50
signal(SIGUSR1, SIG_IGN) = 0xfacefeeddeadbeef
write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown errno 559038737) = 41
write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737
) = 26
exit_group(1) = ?

I think the problem is glibc testing for -ve, but it should be comparing
against -4095 (+cc Matheus)

  #define RET_SCV \
      cmpdi r3,0; \
      bgelr+; \
      neg r3,r3;

This glibc patch at least gets that signal test working. Haven't run the
full suite yet because of trouble making it work with a local glibc
install...

Thanks,
Nick

What about syscalls like times(2) which can return -1 without it being an error?

Jocke

Excerpts from Joakim Tjernlund's message of May 19, 2021 5:33 pm:

I always figured the ppc way was superior. It begs the question if not the other archs should
change instead?

Jocke

Excerpts from Joakim Tjernlund's message of May 19, 2021 6:08 pm:

Excerpts from Joakim Tjernlund's message of May 19, 2021 5:33 pm:
> > Hi,
> >
> > [...]
> > > - Error handling: The consensus among kernel, glibc, and musl is to move to
> > > using negative return values in r3 rather than CR0[SO]=1 to indicate error,
> > > which matches most other architectures, and is closer to a function call.
>
> What about syscalls like times(2) which can return -1 without it being an error?

They do become errors / indistinguishable and have to be dealt with by
libc or userspace. Which does follow what most architectures do (all
except ia64, mips, sparc, and powerpc actually).

Interesting question though, it should have been noted.

Thanks,
Nick

I always figured the ppc way was superior. It begs the question if not the other archs should
change instead?

It is superior in some ways, not enough to be worth being different.

Other archs are unlikely to change because it would be painful for
not much benefit. New system calls just should be made to not return
error numbers. If we ever had a big new version of syscall ABI in
Linux, we can always use another scv vector number for it.

Thanks,
Nick

[...]

With this patch, I think the ptrace ABI should mostly be fixed. I think
a problem remains with applications that look at system call return
registers directly and have powerpc specific error cases. Those probably
will just need to be updated unfortunately. Michael thought it might be
possible to return an indication via ptrace somehow that the syscall is
using a new ABI, so such apps can be updated to test for it. I don't
know how that would be done.

Is there any sane way for these applications to handle the scv case?
How can they tell that the scv semantics is being used for the given
syscall invocation? Can this information be obtained e.g. from struct
pt_regs?

For example, in strace we have the following powerpc-specific code used
for syscall tampering:

$ cat src/linux/powerpc/set_error.c
/*
* Copyright (c) 2016-2021 The strace developers.
* All rights reserved.

Excerpts from Dmitry V. Levin's message of May 19, 2021 8:24 pm:

[...]

With this patch, I think the ptrace ABI should mostly be fixed. I think
a problem remains with applications that look at system call return
registers directly and have powerpc specific error cases. Those probably
will just need to be updated unfortunately. Michael thought it might be
possible to return an indication via ptrace somehow that the syscall is
using a new ABI, so such apps can be updated to test for it. I don't
know how that would be done.

Is there any sane way for these applications to handle the scv case?
How can they tell that the scv semantics is being used for the given
syscall invocation? Can this information be obtained e.g. from struct
pt_regs?

Not that I know of. Michael suggested there might be a way to add
something. ptrace_syscall_info has some pad bytes, could
we use one for flags bits and set a bit for "new system call ABI"?

As a more hacky thing you could make a syscall with -1 and see how
the error looks, and then assume all syscalls will be the same.

Thanks,
Nick

Excerpts from Nicholas Piggin's message of May 19, 2021 6:42 pm:

Excerpts from Joakim Tjernlund's message of May 19, 2021 6:08 pm:

Excerpts from Joakim Tjernlund's message of May 19, 2021 5:33 pm:
> > Hi,
> >
> > [...]
> > > - Error handling: The consensus among kernel, glibc, and musl is to move to
> > > using negative return values in r3 rather than CR0[SO]=1 to indicate error,
> > > which matches most other architectures, and is closer to a function call.
>
> What about syscalls like times(2) which can return -1 without it being an error?

They do become errors / indistinguishable and have to be dealt with by
libc or userspace. Which does follow what most architectures do (all
except ia64, mips, sparc, and powerpc actually).

Interesting question though, it should have been noted.

Thanks,
Nick

I always figured the ppc way was superior. It begs the question if not the other archs should
change instead?

It is superior in some ways, not enough to be worth being different.

Other archs are unlikely to change because it would be painful for
not much benefit. New system calls just should be made to not return
error numbers. If we ever had a big new version of syscall ABI in
Linux, we can always use another scv vector number for it.

Or... is it possible at syscall entry to peek the address of
the instruction which caused the call and see if that was a
scv instruction? That would be about as reliable as possible
without having that new flag bit.

Thanks,
Nick

Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> writes:

As a more hacky thing you could make a syscall with -1 and see how
the error looks, and then assume all syscalls will be the same.

I'm not sure this would work.
Even in glibc, it's expected that early syscalls will use sc while scv is used
later in the execution.