Some basic questions about LLVM version 1.8 bytecode format

I generated LLVM bytecode for a “hello world!” program just to get the basic bytecode structure. I have a few questions about the global info module and the global constants module where there have apparently been changes since 1.4. I would be happy to collect these differences and do an edit pass of the bytecode spec once my decoder is fully up-to-snuff again. I’ve put an annotated bytecode file after my questions to illustrate what I’m trying to sort out about the bytecode.

  1. In the global info module, it looks like an extra bit has been added to global and function definitions. I’m just guessing this because it appears to make the type slot info work out. What is the extra bit for? In this simple example, it appears to always be 1.

  2. There are only two function calls in this little file, and the first one decodes fine, but the second one appears to have the wrong type slot information. Just a guess: is this type slot info maybe always the actual function type slot minus 1 instead of the slot of the pointer to the function?

  3. Looks like library dependencies section is empty even though I would be expecting libc to be here. Unused?

  4. Looks like constant strings are initialized in the constants section now, since it looks like this section ID stuff in the globals module is not used or has changed? Also, I’m finding my constant string is definitely in the constants section when I expected to find just a type slot number.

  5. Again looks like the function pointer type wants to be 0x12 instead of 0x11 here?

  6. After this my decode of the last few bytes of the constants section just started to break down. Any insight you can give me re the meaning of these last few bytes in the constants module would be appreciated.

Here’s the bytecode file I’m looking at (annotated). Interesting bits are marked with five question marks:

Signature = llvc0
00000000 6c 6c 76 63 30

Module block ID = 0x01 and size = 0x0a3
01 00 00 00 a3 00 00 00

Format information
50 = 01010000
^ Target is little endian
^- Target pointers are 32-bit
^-- Target has endianess
^— Target has pointer size
^^^^---- Bytecode format 5

Hi Robert,

I generated LLVM bytecode for a "hello world!" program just to get the
basic bytecode structure. I have a few questions about the global
info module and the global constants module where there have
apparently been changes since 1.4.

Okay.

  I would be happy to collect these differences and do an edit pass of
the bytecode spec once my decoder is fully up-to-snuff again.

Great!

I've put an annotated bytecode file after my questions to illustrate
what I'm trying to sort out about the bytecode.

Very nice.

1) In the global info module, it looks like an extra bit has been
added to global and function definitions. I'm just guessing this
because it appears to make the type slot info work out. What is the
extra bit for? In this simple example, it appears to always be 1.

Bits 5 and higher are used for the slot table index. There is no special
significance. If you're seeing them all be 1, then you're only looking
at the ones with odd slot numbers (since bit 5 is the least significant
bit in the slot number).

There is one special case. If the linkage field is internal (value 3)
and the initializer field is 0 (false) then it indicates that the global
uses an extension word for its info. This is necessary if it has a
non-zero alignment or a section. Unfortunately, I don't think this is
currently documented. See lib/Bytecode/Writer at line 980 for the logic.

2) There are only two function calls in this little file, and the
first one decodes fine, but the second one appears to have the wrong
type slot information. Just a guess: is this type slot info maybe
always the actual function type slot minus 1 instead of the slot of
the pointer to the function?

Slot 0 is reserved for arrays of sbyte .. an optimization for strings.

3) Looks like library dependencies section is empty even though I
would be expecting libc to be here. Unused?

Completely depends on your source language compiler. Its quite valid for
it to be empty. If it was generated with an old llvm-gcc3 its possible
that the deplibs feature is not in your version of llvm-gcc3. Either
that or it doesn't depend on libc? I can't tell .. don't know how your
bytecode file was created.

4) Looks like constant strings are initialized in the constants
section now, since it looks like this section ID stuff in the globals
module is not used or has changed?

It is used. See the code I mentioned above.

Also, I'm finding my constant string is definitely in the constants
section when I expected to find just a type slot number.

Constant strings are handled specially. Instead of having a bunch of
values in the "sbyte" slot (one for each character), which was the
original design, we now detect constant array of sbyte as a special
case, assign its type as slot 0 and write the entire string of
characters as the value (instead of a value for each char).

5) Again looks like the function pointer type wants to be 0x12 instead
of 0x11 here?

I'm not following this question.

6) After this my decode of the last few bytes of the constants section
just started to break down. Any insight you can give me re the
meaning of these last few bytes in the constants module would be
appreciated.

Have you used llvm-bcanalyzer to read your bytecode files? It might help
you with your analysis.

Here's the bytecode file I'm looking at (annotated). Interesting bits
are marked with five question marks:

Signature = llvc0
00000000 6c 6c 76 63 30

Module block ID = 0x01 and size = 0x0a3
01 00 00 00 a3 00 00 00

Format information
50 = 01010000
             ^ Target is little endian
            ^- Target pointers are 32-bit
           ^-- Target has endianess
          ^--- Target has pointer size
      ^^^^---- Bytecode format 5

***********************************************************

Global type pool ID = 0x06 and size = 0x014
86 02 |llvc0........P..|
00000010 00 00

Global type pool

Number of definitions = 7
07

0x0d = Pointer to array of sbyte[18]
10 0e

0x0e = Array of sbyte[18]
0f 03 12

0x0f = Pointer to function int ()
  10 10

0x10 = Function int ()
0d 07 00

0x11 = Pointer to function int ( sbyte*, ... )
10 13

0x12 = Pointer to sbyte
10 |................|
00000020 03

0x13 = Function int ( sbyte*, ... )
0d 07 02 12 00

***********************************************************

Module globals info ID = 0x05 and size = 0x01e
c5 03 00 00

Global definition
af 03 = 0000001110101111
                        ^ Is a constant
                       ^- Has an initializer
                    ^^^-- Linkage = internal
                   ^----- ??? <— see question #1

There's nothing special about this bit, its part of the slot number.

         ^^^^^^^^^^------ Type slot = sbyte[18]
01 = Value slot number of the initializer

End of globals
00

Function definition
e1 03 |…| = 0000001111100001
                                         ^^^^ Calling convention = 1
                                        ^---- Internal
                                       ^----- ??? <— see question
#1

Same thing. Part of the slot number.

                             ^^^^^^^^^^------ Type slot = int (*)()

Function definition
00000030 b1 04 = 0000010010110001
                              ^^^^ Calling convention = 1
                             ^---- External
                            ^----- ??? <— see question #1
                  ^^^^^^^^^^------ Type slot = 0x012 = sbyte*???
<— see question #2

End of functions
00

Depends on no libraries??? <— see question #3
00

Target triple = "i686-pc-linux-gnu"
11 69 36 38 36 2d 70 63 2d 6c 69 6e |.....i686-pc-lin|
00000040 75 78 2d 67 6e 75

Section strings for globals: none??? <----- see question #4
00

I don't think I understand the question here.

Inline asm block: none
00

***********************************************************

Module constant pool ID = 0x03 and size = 0x01f
  e3 03 00 00

Module constant pool

One constant string sbyte[18] = "Hello RKM world!\n" ??? <— see
question #4
01 00 0e 48 |ux-gnu…H|
00000050 65 6c 6c 6f 20 52 4b 4d 20 77 6f 72 6c 64 21 0a |ello RKM
world!.|
00000060 00

Yes, this is the value of the sbyte[18].

Hi Reid,

Thanks for your prompt response and thanks to both you and Chris for your patient help with my dumb and occasionally downright crazy questions.

As usual, I am being dense and more questions or clarifications of my questions are interspersed below:

Reid Spencer wrote:

Hi Robert,

  
I generated LLVM bytecode for a "hello world!" program just to get the
basic bytecode structure.  I have a few questions about the global
info module and the global constants module where there have
apparently been changes since 1.4.
    

Okay.

  
  I would be happy to collect these differences and do an edit pass of
the bytecode spec once my decoder is fully up-to-snuff again.  
    

Great!

  
I've put an annotated bytecode file after my questions to illustrate
what I'm trying to sort out about the bytecode.
    

Very nice.

  
1) In the global info module, it looks like an extra bit has been
added to global and function definitions.  I'm just guessing this
because it appears to make the type slot info work out.  What is the
extra bit for?  In this simple example, it appears to always be 1.
    

Bits 5 and higher are used for the slot table index. There is no special
significance. If you're seeing them all be 1, then you're only looking
at the ones with odd slot numbers (since bit 5 is the least significant
bit in the slot number).

There is one special case. If the linkage field is internal (value 3)
and the initializer field is 0 (false) then it indicates that the global
uses an extension word for its info. This is necessary if it has a
non-zero alignment or a section.  Unfortunately, I don't think this is
currently documented. See lib/Bytecode/Writer at line 980 for the logic.
  

The extension word part is in fact documented in the current bytecode spec.

Here’s my problem. Referring to the areas of my bytecode example below that say “see question #1”, when I decode the slot numbers starting from bit 5 as you say, I get 0x01d for the first type slot, 0x025 for the second type slot, etc. My problem is that the way I’ve decoded them my type slots end at 0x013, so I don’t have a slot 0x01d or a slot 0x025. It could be I’m decoding my global types table wrong, but it is appearing to decode as it has since the early days of LLVM. If I assume bit 5 is some other junk and start looking for a type slot starting at bit 6, it appears to match up better (global string is the correct type, function is a function type, etc.).

If I should really be starting with bit five, can you give me an example of how one of these globals points to its type table entry?

  
2) There are only two function calls in this little file, and the
first one decodes fine, but the second one appears to have the wrong
type slot information.  Just a guess: is this type slot info maybe
always the actual function type slot minus 1 instead of the slot of
the pointer to the function?
    

Slot 0 is reserved for arrays of sbyte .. an optimization for strings.
  

I think I understand the part about the strings. Here I’m talking about the portion of my decoded bytecode example below that says “see question #2.” This refers to the second function, whose type slot appears to decode either as type slot 0x012 or type slot 0x025 depending on whether you start from bit five or bit six. In either case , it doesn’t appear to point to a function pointer type slot like the other function does. This could be explained if my interpretation of my type table below is messed up.

  
3) Looks like library dependencies section is empty even though I
would be expecting libc to be here.  Unused?
    

Completely depends on your source language compiler. Its quite valid for
it to be empty. If it was generated with an old llvm-gcc3 its possible
that the deplibs feature is not in your version of llvm-gcc3. Either
that or it doesn't depend on libc?  I can't tell .. don't know how your
bytecode file was created.

  
4) Looks like constant strings are initialized in the constants
section now, since it looks like this section ID stuff in the globals
module is not used or has changed? 
    

It is used.  See the code I mentioned above.
  

Here I’m referring to the Module Global Info section of the current bytecode spec. In this section the sixth entry in the Module Global Info is described as “A length list of strings that defines a table of section strings for globals. A global’s SectionID is an index into this table.” In this example, this table now appears to be empty. Sounds like constant strings were at one time stored here, but now in this example I’m seeing them in the actual separate constants section. Am I misunderstanding? Are these different strings other than the ones now appearing in the constants section?

  
 Also, I'm finding my constant string is definitely in the constants
section when I expected to find just a type slot number.
    

Constant strings are handled specially. Instead of having a bunch of
values in the "sbyte" slot (one for each character), which was the
original design, we now detect constant array of sbyte as a special
case, assign its type as slot 0 and write the entire string of
characters as the value (instead of a value for each char).

Okay, so now I’m always going to see the strings appear directly after their definitions in the constants table as in the example below?

 

  
5) Again looks like the function pointer type wants to be 0x12 instead
of 0x11 here?
    

I'm not following this question.
  

On the line in the example below that says “see question #5” this appears to be another example of what I was talking about in question #2. It looks like again type slot 0x012 or 0x025 is being called out for what clearly should be a function pointer type, but the entry in the type table as I decoded it isn’t a function pointer type. Can you see where I’m going wrong in my decoding here?

  
6) After this my decode of the last few bytes of the constants section
just started to break down.  Any insight you can give me re the
meaning of these last few bytes in the constants module would be
appreciated.
    

Have you used llvm-bcanalyzer to read your bytecode files? It might help
you with your analysis.
  

I will give this a spin and see what it does!

Thanks again for all your help.

Cheers,

– Robert.