Status of stack walking in LLVM on Win64?

Message: 3
Date: Sun, 3 Jul 2016 17:49:50 -0700
From: Michael Lewis via llvm-dev <llvm-dev@lists.llvm.org>
To: Hayden Livingston <halivingston@gmail.com>
Cc: llvm-dev <llvm-dev@lists.llvm.org>
Subject: Re: [llvm-dev] Status of stack walking in LLVM on Win64?
Message-ID:
<CAEm7p3svyOi6JU6r_RCCtRfGhTgTHeRw-SR0iD+9Edv2pi71Dw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

For JITs it would appear that there is a patch needed for some kind of
relocations.

https://llvm.org/bugs/show_bug.cgi?id=24233

Is the patch really needed? What does it do? I'm not an expert here so
asking.

I'm not really interested in the JIT case as I said originally, so I can't
answer that question.

I can confirm that LLVM emits correct data when used in an AoT

configuration

for x64, exception handling would be totally broken without it.

Two points of clarification:

- Are you talking about Win64 or just x64 in general (i.e. *nix/MacOS)?
Again given the presence of bugs going back to 2015 (including one linked
in this thread) and other scant data from the list, I really can't tell
what the expected state of this functionality is on Win64.

- Are you referring to data generated by LLVM that is embedded in COFF
object files and then placed in the binary image by the linker? This data
is at a minimum relocated by link.exe on Windows as near as I can tell. I
do not want a dependency on link.exe. I can handle doing my own relocations
prior to emitting the final image, but I want to know if there's a turnkey
implementation of this already or if I have to roll my own here.

Thanks,

- Mike

Windows/x64 ABI is pretty well documented.

- The parameter passing is probably not the same as any other system.
(Unless people are using LLVM for UEFI development?)
Ignoring floating point, the first four integer parameters
are in rcx, rdx, r8, r9. The rest are on the stack.

- The exception handling might *resemble* other systems, but
surely has unique details.

- Ghere is absolutely an unremovable dependency on a linker;
it doesn't have to be the Microsoft linker, I believe GNU ld
already implements this.

The documentation should be used.

I can summarize and such, but it is documented.

Roughly, ignoring parameter passing and focusing only on exception handling,
it goes like this:

- At any point in any program, "the stack" must be "unwindable".
I've never seen this clearly described.
It boils down to really "non volatile registers must be restorable"
by "a runtime" via a documented/standardized metadata, such as to
appear as if control was returned to any function on the call stack,
w/o running any generated code in any of the functions between
the current stack location and the resumed-to location.

   The stack pointer is often called out specially, but in fact
   it is just another non volatile register and not really a special case\.
   

 So then some details:
   a &quot;leaf function&quot; is a function that does not change any non volatile registers,
   including the stack pointer\. Leaf functions can do pretty much anything,
   but they must not change any non volatile registers \-\- which is a severe
   restriction\. Have locals essentially makes you non\-leaf \-\- even if you
   don&#39;t call anything\. A leaf function is \*not\* a function that makes no calls,
   but calls do make a function a non\-leaf, as it changes the stack pointer\.

   The slight exception here is that all functions, including leaves, do have
   4\*8 bytes of scratch space in the stack available to them \-\- so local
   variables can be had, in that space and in volatile registers\.

  
  The stack is walked from a leaf function merely by reading from rsp\. 
  A leaf function can make a syscall, so they aren&#39;t necessarily at the bottom of the stack\. 

   
  non\-leaf functions are the interesting ones\.
  They can change rsp, including such as via a call, and can change non\-volatile
  registers, but all such changes \(or rather, the saving of said registers\) must
  be described by metadata, and the metadata
  must be findable \-\- via looking up a code address on the stack\.

  
  Roughly speaking, all dlls have &quot;pdata&quot; \-\- procedure data\.
  There are 3 UINT32s per non\-leaf function\.
  These are offsets into the image\. Images are limited to 4GB in size\.
  They are to the start of the function, end of the function, and to additional metadata\.
  The additional metadata is called &quot;xdata&quot; or exception data\.
  The offset to the metadata be be absent or 0, but that should be rare/nonexistant
  in practise \-\- it is for revealing leaf functions to static analysis for example\.
  

  The &quot;xdata&quot; is then what describes how to restore non volatile registers,
  such as the order to pop them, or what offset they were saved at to the
  frame pointer or stack pointer \(and which register if any is the frame pointer \-\- it doesn&#39;t have to be rbp,
  and most functions don&#39;t have one\.\)
  

  There are restrictions on code generation \-\- rsp changes and non volatile saves
  must be describable with this metadata\. There is a notion of the end of the prologue,
  at this point all non volatiles that will be changed have been saved, and rsp changes
  are done\. This is misleading though in that almost arbitrary code can be interleaved
  within the prologue, i\.e\. changes to volatile registers\.
 

  As well, as a background, generally Windows/x64 functions don&#39;t change rsp,
  except in their prologue and the call instruction\.
  They are not &quot;pushy/poppp&quot;\. However if a function uses \_alloca, that
  is a contradiction\. Such functions must have a frame pointer, such as rbp,
  though it doesn&#39;t have to be rbp and often is not\.
  

  There is also a notion of chaining the data\. This is useful when
  a function has &quot;early out&quot; paths that only change some non volatiles\.
  

  Also there is allowance for discontiguous functions\.
  

  Also there is no metadata for epilogues\. If an exception occurs in an epilogue,
  the runtime actually look at the code being run, detects it is an epilogue
  and simulates it\. As such, epilogue code generation is constrained\.
  \(and breakpoints within epilogues mess things up\!\)
  

  To repeat \-\- the unwindability is from any single instruction, be in the
  middle of a prologue, middle of an epilogue, or in the body of a function
  outside of prologue/epilogue\.
  

  This unwindabilty serves both exception dispatch and debugger stack walking,
  and other things, like sampling profiler stack walking, or &quot;leak tracking
  stack walking&quot; \-\- stack walking is always possible, modulo bugs\.
  The most common bugs are probably in hand written assemble, since
  assembly programmers have to do basically the work themselves\.

  
  There is provision for providing the pdata at runtime for JITed code\.

  
  The linker has to combine all the pdata and place a pointer \(offset\) to it
  in a documented place in the PE, similar to how imports and exports and base
  relocations are recorded\.
  
  
  Anyway, see the documentation\.
  
  
  \- Jay

> Message: 3
> Date: Sun, 3 Jul 2016 17:49:50 -0700
> From: Michael Lewis via llvm-dev <llvm-dev@lists.llvm.org>
> To: Hayden Livingston <halivingston@gmail.com>
> Cc: llvm-dev <llvm-dev@lists.llvm.org>
> Subject: Re: [llvm-dev] Status of stack walking in LLVM on Win64?
> Message-ID:
> <CAEm7p3svyOi6JU6r_RCCtRfGhTgTHeRw-SR0iD+9Edv2pi71Dw@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
>
>> For JITs it would appear that there is a patch needed for some kind of
>> relocations.
>>
>> https://llvm.org/bugs/show_bug.cgi?id=24233
>>
>> Is the patch really needed? What does it do? I'm not an expert here so
>> asking.
>>
>
>
> I'm not really interested in the JIT case as I said originally, so I
can't
> answer that question.
>
>
>
>>
>>> I can confirm that LLVM emits correct data when used in an AoT
>> configuration
>>> for x64, exception handling would be totally broken without it.
>>>
>>
>
>
> Two points of clarification:
>
> - Are you talking about Win64 or just x64 in general (i.e. *nix/MacOS)?
> Again given the presence of bugs going back to 2015 (including one linked
> in this thread) and other scant data from the list, I really can't tell
> what the expected state of this functionality is on Win64.
>
> - Are you referring to data generated by LLVM that is embedded in COFF
> object files and then placed in the binary image by the linker? This data
> is at a minimum relocated by link.exe on Windows as near as I can tell. I
> do not want a dependency on link.exe. I can handle doing my own
relocations
> prior to emitting the final image, but I want to know if there's a
turnkey
> implementation of this already or if I have to roll my own here.
>
> Thanks,
>
>
>
> - Mike

Windows/x64 ABI is pretty well documented.

- The parameter passing is probably not the same as any other system.
   (Unless people are using LLVM for UEFI development?)
   Ignoring floating point, the first four integer parameters
   are in rcx, rdx, r8, r9. The rest are on the stack.

- The exception handling might *resemble* other systems, but
   surely has unique details.

- Ghere is absolutely an unremovable dependency on a linker;
   it doesn't have to be the Microsoft linker, I believe GNU ld
   already implements this.

   The documentation should be used.

   I can summarize and such, but it is documented.

   Roughly, ignoring parameter passing and focusing only on exception
handling,
   it goes like this:

   - At any point in any program, "the stack" must be "unwindable".
       I've never seen this clearly described.
       It boils down to really "non volatile registers must be restorable"
       by "a runtime" via a documented/standardized metadata, such as to
       appear as if control was returned to any function on the call stack,
       w/o running any generated code in any of the functions between
       the current stack location and the resumed-to location.

       The stack pointer is often called out specially, but in fact
       it is just another non volatile register and not really a special
case.

     So then some details:
       a "leaf function" is a function that does not change any non
volatile registers,
       including the stack pointer. Leaf functions can do pretty much
anything,
       but they must not change any non volatile registers -- which is a
severe
       restriction. Have locals essentially makes you non-leaf -- even if
you
       don't call anything. A leaf function is *not* a function that makes
no calls,
       but calls do make a function a non-leaf, as it changes the stack
pointer.

       The slight exception here is that all functions, including leaves,
do have
       4*8 bytes of scratch space in the stack available to them -- so
local
       variables can be had, in that space and in volatile registers.

      The stack is walked from a leaf function merely by reading from rsp.
      A leaf function can make a syscall, so they aren't necessarily at
the bottom of the stack.

      non-leaf functions are the interesting ones.
      They can change rsp, including such as via a call, and can change
non-volatile
      registers, but all such changes (or rather, the saving of said
registers) must
      be described by metadata, and the metadata
      must be findable -- via looking up a code address on the stack.

      Roughly speaking, all dlls have "pdata" -- procedure data.
      There are 3 UINT32s per non-leaf function.
      These are offsets into the image. Images are limited to 4GB in size.
      They are to the start of the function, end of the function, and to
additional metadata.
      The additional metadata is called "xdata" or exception data.
      The offset to the metadata be be absent or 0, but that should be
rare/nonexistant
      in practise -- it is for revealing leaf functions to static analysis
for example.

      The "xdata" is then what describes how to restore non volatile
registers,
      such as the order to pop them, or what offset they were saved at to
the
      frame pointer or stack pointer (and which register if any is the
frame pointer -- it doesn't have to be rbp,
      and most functions don't have one.)

      There are restrictions on code generation -- rsp changes and non
volatile saves
      must be describable with this metadata. There is a notion of the end
of the prologue,
      at this point all non volatiles that will be changed have been
saved, and rsp changes
      are done. This is misleading though in that almost arbitrary code
can be interleaved
      within the prologue, i.e. changes to volatile registers.

      As well, as a background, generally Windows/x64 functions don't
change rsp,
      except in their prologue and the call instruction.
      They are not "pushy/poppp". However if a function uses _alloca, that
      is a contradiction. Such functions must have a frame pointer, such
as rbp,
      though it doesn't have to be rbp and often is not.

      There is also a notion of chaining the data. This is useful when
      a function has "early out" paths that only change some non volatiles.

      Also there is allowance for discontiguous functions.

      Also there is no metadata for epilogues. If an exception occurs in
an epilogue,
      the runtime actually look at the code being run, detects it is an
epilogue
      and simulates it. As such, epilogue code generation is constrained.
      (and breakpoints within epilogues mess things up!)

These is metadata for epilogues (UWOP_EPILOG) but it is only available on
Windows 8.1 and newer.

> These is metadata for epilogues (UWOP_EPILOG) but it is only available on Windows 8.1 and newer.

I'm aware of this.
I believe it is so sampling profilers can walk the kernel stack including through paged code -- i.e. the epilogue data is not paged, while the related epilogue code might be.
Do you see it used, i.e. in usermode? (where the pdata/xdata/code are all equally paged).
It would allow for e.g. breakpoints in epilogues as well, but that doesn't seem to be a consideration.
Perhaps debuggers are supposed to detect epilogues and use hardware breakpoints instead??

And ps, while the documentation is good, I think this basic point of what the goal is -- restoration of non-volatiles from arbitrary points, with the clarification/emphasis that rsp is a slightly special non-volatile -- is not clearly documented.
It is from this motivation that everything pretty directly follows imho.

For example, this is why all ymm registers are all volatile -- because the xdata design precedes their existence and therefore cannot describe their preservation/restoration.

- Jay

> These is metadata for epilogues (UWOP_EPILOG) but it is only available
on Windows 8.1 and newer.

I'm aware of this.
I believe it is so sampling profilers can walk the kernel stack including
through paged code -- i.e. the epilogue data is not paged, while the
related epilogue code might be.
Do you see it used, i.e. in usermode? (where the pdata/xdata/code are all
equally paged).

It would allow for e.g. breakpoints in epilogues as well, but that doesn't

seem to be a consideration.
  Perhaps debuggers are supposed to detect epilogues and use hardware
breakpoints instead??

I don't see it used in practice but I can imagine JITs wanting to use it to
liberate themselves from the normal x64 ABI rules regarding epilogues.
Reid and I spent a lot of time implementing the x64 compliant
prologue/epilogue emission in LLVM and it would have been easier if
UWOP_EPILOG was always around.

And ps, while the documentation is good

The documentation is good but it could be a little more clear. I wish I
could contact whoever maintains the specification...

, I think this basic point of what the goal is -- restoration of
non-volatiles from arbitrary points, with the clarification/emphasis that
rsp is a slightly special non-volatile -- is not clearly documented.

It is from this motivation that everything pretty directly follows imho.

Thanks all - looks like RuntimeDyldCOFFX86_64 is indeed the missing link.
I'm temporarily using my own code to relocate the pdata section during
linking (for unrelated reasons) but I'll definitely explore the dynamic
loader in more detail. I'd much prefer to use well-tested code over my
sloppy 20 minute hack job :slight_smile:

If you want an AOT linker and you don’t want to use link.exe, you can use LLD. At this point it is a fully functional PE COFF linker. It lacks PDB support, but it sounds like you don’t need that yet.

Can we accept the patch which makes this work? There is a patch in the
bug report for the MCJIT case. I'm confused why AOT and MCJIT case
would be different but without it the information isn't registered.

Do you have specific questions?

I noticed the current revision mixes up leaf and non-leaf. I guess the meaning is so obvious that nobody notices the words.

Be sure to use ms link /dump /pdata /unwindinfo.

- Jay

Do you have specific questions?

Not questions exactly but recommendations on how the document could be
clarified for implementers.

For example, the x64 document does not list the instructions which are
considered as epilogue markers. These seem to only be documented in the
remarks for RtlVirtualUnwind.

Details like this are important to get unwinding correctly implemented.

You mean the ADDR32NB relocation patch? I'd be pretty surprised if things
just worked with that.

Ok. What is ADDR32NB even? I thought relocations were ADDR64, ADDR32 and REL32.

“no base”?, aka image relative, i.e. a 32bit offset from the start of a PE?

  • Jay