R_ARM_ABS32 disassembly with integrated-as

I'm attempting to detect encoding bugs by comparing disassembly when
using GCC's 'as' versus LLVM's integrated assembler. Generally this
has gone very well, but one thing that adds a lot of noise is that
.word marked as a R_ARM_ABS32 is disassembled as an instruction and
not data. Please see the attached 'dump.diff' which was generated by
diffing the "objdump -d --all-headers" for each object file.

Is this a bug? If so, how can I fix it?

Thanks,
Greg

dump.diff (10.7 KB)

Hi Greg,

Is this a bug? If so, how can I fix it?

It's somewhere between a bug and a quality-of-implementation issue.
ARM often uses literal pools in the middle of code when it needs to
materialize a large constant (or variable address more likely for
R_ARM_ABS32). This results in a sequence roughly like:

    ldr r0, special_lit_sym
    [...]
    b past_literals
special_lit_sym:
    .word variable_desired
past_literals:
    [...instructions...]

In general, deciding whether to disassemble a given location as code
or data is a very hard problem (think of all the evil tricks you could
play with dual-purpose), so the ARM ELF ABI
(http://infocenter.arm.com/help/topic/com.arm.../IHI0044D_aaelf.pdf)
specifies something called mapping symbols, which assemblers should
insert to tell disassemblers what's actually needed.

The idea is that a $a should be inserted at the start of each section
of ARM code, $t before Thumb and $d before data (including these
embedded litpools). In the above example, $a would be somewhere before
the first ldr, $d at "special_lit_sym" and $a again at
"past_literals". objdump will then use these to decide how to display
a given address.

If you dump the symbol table with "readelf -s" (objdump hides them on
my system at least) you should see these in the GCC binary, but almost
certainly not in the LLVM one.

There's some kind of half-written support already in LLVM I believe,
but it's been broken for as long as I can remember. You'd need to make
the MC emitters properly understand when they're switching between
code and data areas, and insert the appropriate symbols.

Hope this helps.

Tim.

FWIW, I believe the following bugzilla issue reports/covers that
mapping symbols are not being produced:
http://llvm.org/bugs/show_bug.cgi?id=9582

Yes.. I meant to fix that and it kinda slipped... :wink:

Greg,

There should be information enough in the bug to be an easy fix.

If you *really* don't want to look at it, I might have some time in
the near future...

Hi Greg,

Is this a bug? If so, how can I fix it?

It's somewhere between a bug and a quality-of-implementation issue.
ARM often uses literal pools in the middle of code when it needs to
materialize a large constant (or variable address more likely for
R_ARM_ABS32). This results in a sequence roughly like:

   ldr r0, special_lit_sym
   [...]
   b past_literals
special_lit_sym:
   .word variable_desired
past_literals:
   [...instructions...]

In general, deciding whether to disassemble a given location as code
or data is a very hard problem (think of all the evil tricks you could
play with dual-purpose), so the ARM ELF ABI
(http://infocenter.arm.com/help/topic/com.arm.../IHI0044D_aaelf.pdf)
specifies something called mapping symbols, which assemblers should
insert to tell disassemblers what's actually needed.

The idea is that a $a should be inserted at the start of each section
of ARM code, $t before Thumb and $d before data (including these
embedded litpools). In the above example, $a would be somewhere before
the first ldr, $d at "special_lit_sym" and $a again at
"past_literals". objdump will then use these to decide how to display
a given address.

If you dump the symbol table with "readelf -s" (objdump hides them on
my system at least) you should see these in the GCC binary, but almost
certainly not in the LLVM one.

There's some kind of half-written support already in LLVM I believe,
but it's been broken for as long as I can remember. You'd need to make
the MC emitters properly understand when they're switching between
code and data areas, and insert the appropriate symbols.

The recent MachO data-in-code support should have fixed a lot of the problems. There's probably still some quirks in the specifics ($a vs. $t and making sure the symbols get into the ELF properly), but the core functionality to know how to mark data regions is there and works very well.

-Jim

Hi Jim,

I'm trying to help Greg crack it down. From your recent commits, I
take it you're re-using a data-in-code detection previously used only
for ASM output, to object output, via the
EmitDataRegion/EmitDataRegionEnd.

I haven't looked too deep in the MC, but I'm supposing that will work
automatically when the output streamer is printing object code and
meets a non-code region, so in theory, changing MCELFStreamer
accordingly (overriding those functions in there) would take care of
data vs. code issue in ELF.

Assuming LLVM doesn't generate ARM/Thumb veneers inside the same
function (ie. a Thumb function has only Thumb code), Greg could use
the EmitDataRegion and EmitDataRegionEnd, with the former saving the
state of the current code (Thumb/Arm) and the latter restoring it, by
emiting the $d and $a/t respectively.

Does it seem like a good initial approach?

Continuing... It seems MCELFStreamer already has a EmitThumbFunc,
which looks to me as the wrong place to be. I'd imagine MCELFStreamer
would have EmitFunc and MCARMELCStreamer (or whatever) would identify
its type and call the appropriate EmitThumbFunc/EmitARMFunc. Being
pedantic, even that is still too high level because of the ARM/Thumb
veneers, but we don't want to worry about that if LLVM doesn't even
try to mix ARM and Thumb (and assuming external libraries would have
the symbols, if they do).

Generating or not, LLVM's disassembler should know about those symbols
and should be able to mark them accordingly. Where would be the best
part to put those symbols (in an enum or table), so that the
MCStreamer and the disassembler could reference a single place?

The recent MachO data-in-code support should have fixed a lot of the problems. There's probably still some quirks in the specifics ($a vs. $t and making sure the symbols get into the ELF properly), but the core functionality to know how to mark data regions is there and works very well.

Hi Jim,

I'm trying to help Greg crack it down. From your recent commits, I
take it you're re-using a data-in-code detection previously used only
for ASM output, to object output, via the
EmitDataRegion/EmitDataRegionEnd.

It's a bit more than that. Those Emit* methods are new for this support. There was spotty support for the raw $a/$t/$d stuff before, and this abstracted and extended it to support both asm and binary emission as well as added uses for the methods to the various bits in the ARM backend where data-in-code regions get created (jump tables, constant pools, et. al.).

I haven't looked too deep in the MC, but I'm supposing that will work
automatically when the output streamer is printing object code and
meets a non-code region, so in theory, changing MCELFStreamer
accordingly (overriding those functions in there) would take care of
data vs. code issue in ELF.

Yep. They'll likely be implemented as, effectively, an EmitLabel().

Assuming LLVM doesn't generate ARM/Thumb veneers inside the same
function (ie. a Thumb function has only Thumb code), Greg could use
the EmitDataRegion and EmitDataRegionEnd, with the former saving the
state of the current code (Thumb/Arm) and the latter restoring it, by
emiting the $d and $a/t respectively.

Does it seem like a good initial approach?

Continuing... It seems MCELFStreamer already has a EmitThumbFunc,
which looks to me as the wrong place to be.

That's just the handler for the .thumb_func directive. It has nothing to do with emitting the contents of the actual function.

I'd imagine MCELFStreamer
would have EmitFunc and MCARMELCStreamer (or whatever) would identify
its type and call the appropriate EmitThumbFunc/EmitARMFunc. Being
pedantic, even that is still too high level because of the ARM/Thumb
veneers, but we don't want to worry about that if LLVM doesn't even
try to mix ARM and Thumb (and assuming external libraries would have
the symbols, if they do).

This is complicated a bit by needing to work for plain .s files, not just compiler generated files. Those can intermix arm and thumb code in crazy ways.

The assembler already has a thumb vs. arm mode state (which gets adjusted via the .arm/.thumb directives and the .code synonyms). ELF will want to check that state and use it to determine whether a data-region-end directive should result in a $a or a $t in the output ELF.

Generating or not, LLVM's disassembler should know about those symbols
and should be able to mark them accordingly. Where would be the best
part to put those symbols (in an enum or table), so that the
MCStreamer and the disassembler could reference a single place?

It's not the disassembler itself that should know about them, but the driver for the disassembler. In this case, llvm-objdump. The disassembler doesn't have that kind of gestalt knowledge.

-Jim

Thanks Jim!

I have updated the bug with your comments, I think it's a good start.

Greg, let me know if that's not enough, I think I can help you from now on.

cheers,
--renato

Cool; glad to help.

When I added the data region bits, I tried to keep the ARM-style annotations in mind a bit, so hopefully things will fit together without too much trouble.

-Jim

Great, thanks for your help. I'll take a crack at it and contact
Renato if I have questions.

-Greg

Getting closer… When emitting symbols, how do I set the symbol’s value to the address of the current instruction? Do I need to emit a label in the current section and another that uses the former to point to the latter? If possible, a code sample would be very helpful.

And probably questions for Tim, are these “section-relative” mapping symbols, as defined in 4.6.5.1 of the ELF for ARM document? And what to put in the alignment field? I see GCC outputting 1, 3, 4, but I don’t see a description of that field in the doc.

Lastly, from MCELFStreamer, how do I determine if we generating an ARM or Thumb ELF? I can catch Thumb from the EmitThumbFunc, but that seems a little odd. Suggestions?

Here’s what I have so far:

$ readelf -s via-llvm-as.o | grep “$.”
2: 00000000 0 NOTYPE LOCAL DEFAULT 4 $d
3: 00000000 0 NOTYPE LOCAL DEFAULT 4 $t

$ readelf -s via-gcc-as.o | grep “$.”
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 $t
15: 0000020c 0 NOTYPE LOCAL DEFAULT 1 $d
17: 00000218 0 NOTYPE LOCAL DEFAULT 1 $t
44: 00000732 0 NOTYPE LOCAL DEFAULT 1 $d
45: 0000073e 0 NOTYPE LOCAL DEFAULT 1 $t
65: 00000000 0 NOTYPE LOCAL DEFAULT 4 $d
66: 00000000 0 NOTYPE LOCAL DEFAULT 3 $d

Thanks,
Greg

Hi Greg,

I'm afraid I've not looked into the infrastructure Jim put into place,
so I've not really been able to answer the "how should I do it"
questions, but hopefully I can comment on the ABI.

And probably questions for Tim, are these "section-relative" mapping
symbols, as defined in 4.6.5.1 of the ELF for ARM document?

Yes, they are.

And what to put in the alignment field? I see GCC outputting 1, 3, 4, but I don't see a
description of that field in the doc.

I don't think individual symbols have an alignment in ELF (except
COMMON ones, which repurpose the st_value field -- not the case here).
If the 1, 3, 4 are coming from the last column of the dumps you
produced below, they're referring to the section the symbol is
relative to (st_shndx in the documentation).

Cheers.

Tim.

Jim, can you help me out with the implementation details here? How do
I set the Value of the MCSymbolData to the address of the data region?

Per Tim's comments, here's what I have so far:

void MCELFStreamer::EmitMappingSymbol(StringRef Name) {
  MCSymbol *Symbol = getContext().GetOrCreateSymbol(Name);
  MCSymbolData &SD = getAssembler().getOrCreateSymbolData(*Symbol);
  MCELF::SetType(SD, ELF::STT_NOTYPE);
  MCELF::SetBinding(SD, ELF::STB_LOCAL);
  SD.setExternal(false);
  Symbol->setSection(*getCurrentSection());
}

...
EmitMappingSymbol("$d");

Thanks,
Greg

Lastly, from MCELFStreamer, how do I determine if we generating an ARM or
Thumb ELF?

That was the only part I didn't know how to get. Jim should know.

I can catch Thumb from the EmitThumbFunc, but that seems a
little odd.

Ignore EmitThumbFunc, it has nothing to do with your change.

$ readelf -s via-llvm-as.o | grep "\$."
     2: 00000000 0 NOTYPE LOCAL DEFAULT 4 $d
     3: 00000000 0 NOTYPE LOCAL DEFAULT 4 $t

Clearly, you're not detecting all code/data changes, or the direct ELF
emission is not creating too many constant pools.

Can you attach the assembly generated and the ELF object created from
both the inline asm and the gcc asm?

Clearly, you're not detecting all code/data changes, or the direct ELF
emission is not creating too many constant pools.

Can you attach the assembly generated and the ELF object created from
both the inline asm and the gcc asm?

I *think* he's comparing the same assembly in both cases: in effect
"clang -integrated-as" vs. "clang". I'm basing this on the names of
his previous attachments: "llvm-via-gcc.dump" and
"llvm-via-integrated-as.dump".

And I'd trust LLVM to emit the constant pools it had decided on,
otherwise things would have gone horribly wrong long before mapping
symbols became an issue to anyone.

As you say, it looks like there are some missing, but that should be
reasonably easy and local to fix once we know how to find out what
needs emitting.

Tim.

Attached is an example of how to reproduce the issue. It uses a C
file that happens to has a bunch of switch statements which are
encoded as jump tables, giving us data-in-code. Usage:

To build object files with clang via the -integrated-as versus via GCC:

$ export NDK_DIR=<my_ndk_dir>
$ export LLVM_DIR=<my_llvm_bin_dir>
$ make

To test that the generated objects contain the same Mapping Symbols:

$ make test

If "make test" fails, a diff is printed containing what GCC generates
versus LLVM.

To bypass clang and gcc (say you don't want to install the NDK), you
can build the same LLVM object file with just:

$ make ll

To bypass llc, you can try "make asm" to first generate a .s and then
compile that. But if you do this, one runs into two more bugs.
First, the MC layer fails to parse ARM ELF, only MachO. Second, clang
fails to care, bypassing the integrated-as and instead generating the
.o via GCC. If you happen to have -ccc-gcc-name set, you will think
your test passes when what actually happened is that both objects were
compiled with GCC!

Thanks,
Greg

scaffold.C (20.5 KB)

tsthd.h (3.21 KB)

scaffold-arm.ll (45.2 KB)

Makefile (1.76 KB)

Hi Jim,

The diff below is not intended to be a patch, but a starting point.
It is the shortest path (I hope) to getting LLVM to emit ARM mapping
symbols to the ELF without changing any shared interfaces. Could you
have a look at the FIXME comments and offer some pointers on how to
get this code out of MCELFStreamer?

Thanks,
Greg

diff --git a/lib/MC/MCELFStreamer.cpp b/lib/MC/MCELFStreamer.cpp
index 8107005..153ca78 100644
--- a/lib/MC/MCELFStreamer.cpp
+++ b/lib/MC/MCELFStreamer.cpp
@@ -40,12 +40,14 @@ class MCELFStreamer : public MCObjectStreamer {
public:
   MCELFStreamer(MCContext &Context, MCAsmBackend &TAB,
                   raw_ostream &OS, MCCodeEmitter *Emitter)
- : MCObjectStreamer(Context, TAB, OS, Emitter) {}
+ : MCObjectStreamer(Context, TAB, OS, Emitter),
+ IsThumb(false), MappingSymbolCounter(0) {}

   MCELFStreamer(MCContext &Context, MCAsmBackend &TAB,
                 raw_ostream &OS, MCCodeEmitter *Emitter,
                 MCAssembler *Assembler)
- : MCObjectStreamer(Context, TAB, OS, Emitter, Assembler) {}
+ : MCObjectStreamer(Context, TAB, OS, Emitter, Assembler),
+ IsThumb(false), MappingSymbolCounter(0) {}

   ~MCELFStreamer() {}
@@ -58,6 +60,7 @@ public:
   virtual void EmitLabel(MCSymbol *Symbol);
   virtual void EmitAssemblerFlag(MCAssemblerFlag Flag);
   virtual void EmitThumbFunc(MCSymbol *Func);
+ virtual void EmitDataRegion(MCDataRegionType Kind);
   virtual void EmitAssignment(MCSymbol *Symbol, const MCExpr *Value);
   virtual void EmitWeakReference(MCSymbol *Alias, const MCSymbol *Symbol);
   virtual void EmitSymbolAttribute(MCSymbol *Symbol, MCSymbolAttr Attribute);
@@ -108,6 +111,7 @@ public:
private:
   virtual void EmitInstToFragment(const MCInst &Inst);
   virtual void EmitInstToData(const MCInst &Inst);
+ virtual void EmitMappingSymbol(bool IsData);

   void fixSymbolsInTLSFixups(const MCExpr *expr);

@@ -119,6 +123,11 @@ private:
   std::vector<LocalCommon> LocalCommons;

   SmallPtrSet<MCSymbol *, 16> BindingExplicitlySet;

+ virtual void EmitMappingSymbol(bool IsData);

I'd use an enum, or have multiple internal implementations...

EmitDataMappingSymbol -> { nop on base class, on ARM, prints "$d" }
EmitCodeMappingSymbol -> { nop on base class, calling either
EmitThumbMappingSymbol or EmitARMMappingSymbol (private) on ARM }

+void MCELFStreamer::EmitMappingSymbol(bool IsData) {
+ // FIXME: The following is specific to the ARM. This should be moved
+ // to ARMAsmBackend.

Maybe MCARMELFStreamer (or whatever sounds nicer than that). ARMAsm is
a big bag of code and nowadays, most of it is format agnostic, I think
(asm, elf).

Thanks Renato. I’m finishing up a patch for this and will post it to llvm-commits. But one concern, to create an ARMELFStreamer as you recommend, I had to move MCELF.h to “include/llvm/MC” and added a MCELFStreamer.h to the same directory. That okay to do?

-Greg

This is not a trivial question, and I'll let others chip in.

Superficially, you'd think so and it might make sense in the long run,
but you have to consider why it wasn't there in the first place.

It may be that it didn't need to (in which case, it's ok to move), or
it may be that the design of MC needed those headers to be more
private (in which case it might not be ok).

Certainly, Tim or Jim know better than I do.