[AVR] alloca bug?

Hi,

Apologies if this is just me being dumb!

Either I’m going mad or alloca doesn’t work in a way that makes sense to me on the AVR back end?

I have some code that creates a load of alloca instructions. The program hung and when I stepped through in AVR-gdb I found it seemed to be creating stack corruption.

The IR…

define protected i32 @main(i32 %0, i8** nocapture readnone %1) local_unnamed_addr addrspace(1) #0 !dbg !31 {
 %3 = tail call swiftcc addrspace(1) %swift.metadata_response @"$ss11_ArrayIndexVMa"(i16 0) #6, !dbg !43
 %4 = extractvalue %swift.metadata_response %3, 0, !dbg !43
 %5 = getelementptr inbounds %swift.type, %swift.type* %4, i16 -1, !dbg !43
 %6 = bitcast %swift.type* %5 to i8***, !dbg !43
 %7 = load i8**, i8*** %6, align 2, !dbg !43, !invariant.load !2, !dereferenceable !65
 %8 = getelementptr inbounds i8*, i8** %7, i16 8, !dbg !43
 %9 = bitcast i8** %8 to i16*, !dbg !43
 %10 = load i16, i16* %9, align 2, !dbg !43, !invariant.load !2
 %11 = alloca i8, i16 %10, align 16, !dbg !43
 call addrspace(1) void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %11), !dbg !43
 %12 = bitcast i8* %11 to %swift.opaque*, !dbg !43
 %13 = alloca i8, i16 %10, align 16, !dbg !66
 call addrspace(1) void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %13), !dbg !66
 %14 = bitcast i8* %13 to %swift.opaque*, !dbg !66
etc etc something like 20 times

when I looked at the assembly lowered from one of the alloca instructions I saw this (as an example)…

    360:    2d b7           in    r18, 0x3d    ; 61
    362:    3e b7           in    r19, 0x3e    ; 62
    364:    28 1b           sub    r18, r24
    366:    39 0b           sbc    r19, r25
    368:    20 7f           andi    r18, 0xF0    ; 240
    36a:    49 01           movw    r8, r18
    36c:    0f b6           in    r0, 0x3f    ; 63
    36e:    f8 94           cli
    370:    3e bf           out    0x3e, r19    ; 62
    372:    0f be           out    0x3f, r0    ; 63
    374:    2d bf           out    0x3d, r18    ; 61

…basically repeated again and again. Obviously there’s an inefficiency with repeatedly loading and saving the SP. But the issue is it seems to me like it’s off by one? Say the stack pointer starts at 0x8FF, each time you PUSH, the SP will store the value in RAM at the location of SP (0x8FF), then decrement SP (0x8FE).

The alloca decrements SP by a chunk to create a stack allocation, say 16 bytes, then saves a pointer for use later (in this case the pointer is saved in r8/r9). My issue with this is the above code saves a pointer to the current SP, which is after the end of the stack. I think it should be saving a pointer to SP+1 should it not?

The stack corruption happens in my code later when a subroutine is called, pushing the return address onto the stack, then in the subroutine, data is stored at the address returned by one of these faulty alloca instructions. And as a result the return address is partly corrupted, when the subroutine hits RET it jumps back to the wrong place and chaos ensues!

What’s strange to me, this is coming from llvm head from the end of August and I would expect everyone to be going crazy if alloca was genuinely broken like this? Am I doing something silly?

Thanks for any advice and help.

Carl

These are dynamic allocas, which are fairly unusual/exotic. I wouldn’t be terribly surprised if this was buggy

I think you’re right that it’s buggy.

AVR is pretty unusual in using an “empty descending” stack; most targets are “full descending”, i.e. the stack pointer points to the most recent occupied slot. On top of that the backend seems to tell generic code to handle a variable alloca (setOperationAction(ISD::DYNAMIC_STACKALLOC, MVT::i16, Expand) in AVRISelLowering.cpp), and presumably it can’t.

As you say, it’s an off-by-one error. It wouldn’t be terribly difficult to fix.

Fantastic help as ever guys! :slight_smile:

@arsenm
That makes sense. So because the size of these things is coming out of metadata (from some kind of C function in the runtime), llvm figures this is a “dynamic alloca”. Presumably if it was an alloca with a constant size or one that could be deduced at compile time after constant folding then llvm would make it a “static alloca”? And I’m assuming that’s much more efficient because the compiler can just adjust the stack frame that’s built in the function prelude?

Also, more importantly for me, it would work on a different code path and one that probably well tested and works.

As a work around on my end I might be able to get the front end to emit static allocations because I can tweak the swift going in (it is standard library swift that I control).

And this all makes sense, as you say dynamic alloca are exotic so untested. And I’ve just added way more complex code to stdlib that’s being lowered here. So it explains why the bug is “new” for me.

@TNorthover
Again that makes perfect sense. When I grepped in lib/Target/AVR I found very little code and very few comments relating to “alloca” so it seemed that what was probably happening was AVR was falling back on common code that wasn’t appropriate here. That’s happened many times before with AVR bugs.

That’s very interesting that most other processors have SP pointing to the end of the stack, the most recently added bytes. I didn’t know that! And it would totally explain the off by one!

I’m up for attempting to fix dynamic allocas properly on AVR. Should be fun. :slight_smile:

To get an understanding of how this works, what code should I be adding or what should I read? It’s not tablegen right? I have to write some special case code to emit an adjusted dynamic alloca?

Thank yous!!

The simplest fix would be in AVRISelLowering.cpp. Roughly:

  1. Change the AVR setOperationAction lines to Custom
  2. Add the case to LowerOperation in the same file.
  3. StealAdapt the implementation from LowerDYNAMIC_STACK_ALLOC generic code (here) to cope with the off-by-one. At first glance that means reduce the first SUB by one, and add an extra Tmp1 = DAG.getNode(ISD::SUB, DL, VT, Tmp1, DAG.getConstant(1, DL, VT)); at the end, just before it gets copied back to SP.

Good advice. Also I see ARM has an implementation. I often copy theirs!

p.s. Oddly the custom dynamic alloca lowering for ARM is only for Windows (Windows… on ARM!!)

I decided to try this approach, do you think it’s OK?

SDValue
AVRTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const {

  SDNode* Node = Op.getNode();

  Register SPReg = DAG.getTargetLoweringInfo().getStackPointerRegisterToSaveRestore();
  assert(SPReg && "Target cannot require DYNAMIC_STACKALLOC expansion and"
          " not tell us which reg is the stack pointer!");
  SDLoc dl(Node);
  EVT VT = Node->getValueType(0);
  SDValue Tmp1 = SDValue(Node, 0);
  SDValue Tmp2 = SDValue(Node, 1);
  SDValue Tmp3 = Node->getOperand(2);
  SDValue Chain = Tmp1.getOperand(0);

  // Chain the dynamic stack allocation so that it doesn't modify the stack
  // pointer when other instructions are using the stack.
  Chain = DAG.getCALLSEQ_START(Chain, 0, 0, dl);

  SDValue Size  = Tmp2.getOperand(1);
  SDValue SP = DAG.getCopyFromReg(Chain, dl, SPReg, VT);
  Chain = SP.getValue(1);
  Align Alignment = cast<ConstantSDNode>(Tmp3)->getAlignValue();
  const TargetFrameLowering *TFL = DAG.getSubtarget().getFrameLowering();
  unsigned Opc =
    TFL->getStackGrowthDirection() == TargetFrameLowering::StackGrowsUp ?
    ISD::ADD : ISD::SUB;

  Align StackAlign = TFL->getStackAlign();
  Tmp1 = DAG.getNode(Opc, dl, VT, SP, Size);       // Value
  if (Alignment > StackAlign)
    Tmp1 = DAG.getNode(ISD::AND, dl, VT, Tmp1,
                       DAG.getConstant(-Alignment.value(), dl, VT));
  Chain = DAG.getCopyToReg(Chain, dl, SPReg, Tmp1);     // Output chain

  // now add one to our pointer to fix the off by one error seen on AVR
  Tmp1 = DAG.getNode(ISD::ADD, dl, VT, Tmp1,
                       DAG.getConstant(1, dl, VT));

  Tmp2 = DAG.getCALLSEQ_END(Chain, DAG.getIntPtrConstant(0, dl, true),
                            DAG.getIntPtrConstant(0, dl, true), SDValue(), dl);

// Tmp1 is the pointer result of the dynamic alloca
// Tmp2 is the chain from callseq_end

  SDValue Ops[2] = { Tmp1, Tmp2 };
  return DAG.getMergeValues(Ops, dl);
}

…more or less copied from SelectionDAGLegalize::ExpandDYNAMIC_STACKALLOC as you suggested.

I thought about your approach but thought, because it’s dynamic and you’re not sure at the time of code emission how large the actual size is that you’re allocating (right?) to reduce the SUB by one, I’d need to add in another register and a calculation node I think? Which seemed a bit less efficient. So I took the approach to just correct the pointer at the end instead by ADD-ing one to it. In my tests locally it seems to produce working code that fixes the stack corruption (as before, I stepped through with a simulator and avr-gdb to make sure the stack was OK)

I had to try to work out how to fit it into a routine in AVRTargetLowering::LowerOperation where the call site is slightly different and I’m still new to llvm back end structures so I had to guess/crib a bit. Which is why I added the getMergeValues, which I’ll happily admit I don’t really understand, to get an SDValue to return. There might be a much better approach? Looking at dot diagrams of the DAG produced before this patch and after, and comparing them to the existing code (using lldb, breakpointing the function and DAG.dumpDotGraph), it looks roughly right to me as far as I can see?

The getCALLSEQ_START and getCALLSEQ_END seem a bit odd, I thought these were for function prolog/epilog purposes? I guess in this case they help maintain a Chain outside of the individual node legalisation/expansion steps so that if there are sequential dynamic alloca operations, they are kept strictly ordered for SP access?

The ADD operation that I added isn’t part of this Chain in the DAG and in my example program actually seems to get done a lot later, well past the block of alloca-s, just before the pointer is passed to a function… but that’s fine and does no harm as far as I can see.

If you guys think the approach I’ve taken is OK, I’ll create some unit tests and submit a patch via Phabricator?

Not adjusting the original SUB amount seems reasonable, I did wonder about it at the time.

So I took the approach to just correct the pointer at the end instead by ADD-ing one to it.

Doesn’t this violate the alignment requirements of the alloca (which is the whole reason for the AND operation)?

If you’re planning to upstream the patch, I’d probably strip out the ability to handle stacks going both up and down. Presumably AVR only does one of these, so it’s just extra code that’ll never be used here.

Yeah, you’re right about alignment. It needs to respect that. I think AVR doesn’t require stack alignment (What alignment is needed on 8-bit AVR? - Stack Overflow) so probably the frontend that’s creating these dynamic alloca-s should be fixed to alignment == 1. Then as I read it, the AND operation will not be emitted.

This should also create much more efficient RAM usage on the stack in the generated code.

But regardless of that, my patched code must respect the alignment passed as a parameter, in case people in future want to use it aligned for some reason. The thing I’ve got to work out in this case is what “stack alignment” means? If someone asks for 16 byte aligned stack memory, do they expect the pointer returned to be on a 16-byte boundary? Or do they expect the SP to be left on a 16-byte boundary? Given that on this platform, the two will be one byte different. It’s a bit hard to know!?

If someone asks for 16 byte aligned stack memory, do they expect the pointer returned to be on a 16-byte boundary?

Yes, definitely this. The stack pointer is a hidden detail the compiler manages, there’s basically nothing the user can do if it’s 16-byte aligned.

An actual pointer, on the other hand… You can stash spare data in the lower bits; aligned accesses are often more efficient (probably not on AVR, but still); hardware devices might need aligned buffers.

The more I think about it, the more I’m not sure what the valid case for a dynamic alloca with alignment other than 1 is. And I’m now sure the front end I’m using here should not be trying to emit these instructions with an alignment of 16. But for completenes I want the bugfix to cover all cases, in case someone needs this in the future. And after a bit of reading around, I’m sure that the returned pointer should be aligned, not the new value of SP. SP being “after the end of the stack” makes the code emitted in these cases a bit of a headache.

This is what I came up with in the end. I think 99 times out of 100 people will be emitting this with aligment == 1, where it’s much simpler, so I wanted it to emit more efficient code for the majority of cases. Let me know what you guys think…

SDValue
AVRTargetLowering::LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const {

  SDNode* Node = Op.getNode();
  SDLoc dl(Node);
  EVT VT = Node->getValueType(0);
  SDValue Tmp1 = SDValue(Node, 0);
  SDValue Tmp2 = SDValue(Node, 1);
  SDValue Tmp3 = Node->getOperand(2);
  SDValue Chain = Tmp1.getOperand(0);

  // Chain the dynamic stack allocation so that it doesn't modify the stack
  // pointer when other instructions are using the stack.
  Chain = DAG.getCALLSEQ_START(Chain, 0, 0, dl);

  SDValue Size  = Tmp2.getOperand(1);
  SDValue SP = DAG.getCopyFromReg(Chain, dl, AVR::SP, VT);
  Chain = SP.getValue(1);

  Align Alignment = cast<ConstantSDNode>(Tmp3)->getAlignValue();
  if (Alignment > Align(1)) {
    // we need some slightly complicated handling when the alignment > 1
    // in order to get the pointer aligned on the correct boundary, for example
    // if SP started as 0x8FF and we needed to emit a dynamic alloca 16 byte boundary
    // aligned, we need to finish with a value like 0x8F0... so if size requested was
    // 16 bytes, the pointer returned would be 0x8F0 and the SP would be set to 0x8EF
    // likewise if we requested 10 bytes, etc.
    // to make this work, we need to subtract (Size-1) from SP, AND the value
    // with a suitable bitmask, return that as the pointer, then subtract one more
    // from that pointer value to place in SP
    Size = DAG.getNode(ISD::SUB, dl, VT, Size, DAG.getConstant(1, dl, VT));
    Tmp1 = DAG.getNode(ISD::SUB, dl, VT, SP, Size);       // Value
    Tmp1 = DAG.getNode(ISD::AND, dl, VT, Tmp1,
                       DAG.getConstant(-Alignment.value(), dl, VT));
    SDValue NewSP = DAG.getNode(ISD::SUB, dl, VT, Tmp1, DAG.getConstant(1, dl, VT));
    Chain = DAG.getCopyToReg(Chain, dl, AVR::SP, NewSP);     // SP Output chain
  } else {
    // the handling in the more usual case is much simpler
    // subtract size from SP and return a pointer one higher than SP
    Tmp1 = DAG.getNode(ISD::SUB, dl, VT, SP, Size);       // Value
    Chain = DAG.getCopyToReg(Chain, dl, AVR::SP, Tmp1);     // SP Output chain
    Tmp1 = DAG.getNode(ISD::ADD, dl, VT, Tmp1, DAG.getConstant(1, dl, VT)); // output ptr
  }

  Tmp2 = DAG.getCALLSEQ_END(Chain, DAG.getIntPtrConstant(0, dl, true),
                            DAG.getIntPtrConstant(0, dl, true), SDValue(), dl);

// Tmp1 is the pointer result of the dynamic alloca
// Tmp2 is the chain from callseq_end

  SDValue Ops[2] = { Tmp1, Tmp2 };
  return DAG.getMergeValues(Ops, dl);
}

…in my first test, it seemed to emit sensible, functioning code and doesn’t corrupt the stack. If this is the right approach, I’ll write out detailed unit tests to test all cases.

p.s. I’ve also removed the generalised code as you suggested, and put in the AVR values.

void foo(size_t n) {
    int a[n];
    ...
}

That gives a DYNAMIC_ALLOCA of alignment _Alignof(int), normally 4 (assuming AVR isn’t weird like m68k).

void foo(size_t n) {
    void *p = __builtin_alloca(n);
    ...
}

That gives a DYNAMIC_ALLOCA of alignment _Alignof(max_align_t).

I think AVR is pretty weird. :slight_smile:

If the dynamic alloca has a higher alignment requirement than the ABI guaranteed stack alignment, the correct way to handle it is to realign the stack pointer in PrologEpilogInserter. You shouldn’t try to deal with this in the lowering code

I’m out of my depth, I’ll happily admit, but I’m not sure how to handle the super weird AVR stack pointer requirements efficiently without the above code. Especially in my case where the frontend is emitting a load of complex dynamic allocas and potentially they might each require some kind of alignment requirement on the returned pointers as well as exact size, etc. and need to be as efficient as possible on these teeny devices.

Maybe the best approach is I raise the “PR” (wrong word I know) on Phabricator, including your comment above (and of course suitable unit tests) for the attention of the AVR maintainers, who probably understand more than me about the weirdness of this platform?

Then they can rip apart my ideas and we can get somewhere. Is that OK?

Yes, post the patch to phabricator. I would recommend against special casing the align 1 case. Just emit the code to align the pointer. If you need to implement dynamic stack realignment, that would be a separate patch

Normally you’d branch on Alignment > TFL->getStackAlignment() and emit overaligning code there as needed.