Making library calls for obj2yaml functionalities

Hi All,

Following up on https://lists.llvm.org/pipermail/llvm-dev/2020-July/143512.html, and https://reviews.llvm.org/D85408, we would like to consider a design which allows external tools to read the structured contents of the .bb_addr_map section with library calls into an LLVM library. At the same time, we need to have tools/obj2yaml tests in place for bb_addr_map. So it sounds like the perfect place to do it. However, the current structure does not expose the obj2yaml functionalities under lib/ObjectYAML.

In general, there seems to be an inconsistency regarding how obj2yaml and yaml2obj tools are structured. The latter has a nice wrapper which calls environment-dependent functions under lib/ObjectYAML, but the former has environment-dependent source files under tools/obj2yaml (elf2yaml.cc, coff2yaml.cc, etc).

I wanted to reach out to ask if there is any alternative or get an idea about the amount of refactoring work that is required to make the structure friendlier.

bests,

Hi Rahman,

Traditionally, the ability to read sections is a feature added to llvm-readobj/llvm-readelf. For some sections, it delegates to methods in places like the Object library and BinaryFormat, but for the more specialised sections, it typically has code local to itself doing the work. The same is true for other dumping tools like obj2yaml and llvm-objdump, which means in some cases, we have multiple varieties of parsers for the same thing. I’m not sure there’s necessarily a strong motivation for doing so, however, so I’d be happy to support functionality being added elsewhere in one of those libraries, which tools like obj2yaml and llvm-readobj can hook into. I’m also happy to support refactoring that improves code reuse within the tools, though I don’t have any further ideas on this.

Can I ask what your motivation for using obj2yaml is in this context? If it’s just for testing purposes, adding support to llvm-readobj/llvm-readelf would be the more normal way, as it allows you to dump just that section.

Only tangentially relatedly, I’ve only just seen your previous patch/email thread, and I do have one thing I’d like to ask if it can be changed. At the moment, the section type is SHT_PROGBITS, but I think it would be better, if possible, to define a new SHT_* type for the new section? In general, it is bad design to rely on section names to distinguish between different kinds of sections - this requires the linker and other tools to have to do unnecessary string comparisons, which are slower and messier than switching on the sh_type field.

James

James,
Thanks for the detailed response. Please see my thoughts inline.

Hi Rahman,

Traditionally, the ability to read sections is a feature added to llvm-readobj/llvm-readelf. For some sections, it delegates to methods in places like the Object library and BinaryFormat, but for the more specialised sections, it typically has code local to itself doing the work. The same is true for other dumping tools like obj2yaml and llvm-objdump, which means in some cases, we have multiple varieties of parsers for the same thing. I’m not sure there’s necessarily a strong motivation for doing so, however, so I’d be happy to support functionality being added elsewhere in one of those libraries, which tools like obj2yaml and llvm-readobj can hook into. I’m also happy to support refactoring that improves code reuse within the tools, though I don’t have any further ideas on this.

Can I ask what your motivation for using obj2yaml is in this context? If it’s just for testing purposes, adding support to llvm-readobj/llvm-readelf would be the more normal way, as it allows you to dump just that section.

Other than testing, we currently have code in an external tool called create_llvm_prof for parsing the “.bb_addr_map” section (+Han Shen who’s the main developer of that tool) and loading it in memory. It would’ve been great if we could just link with an LLVM library which includes the data-structure and parsing support. It looks like llvm-readelf is more structured around dumping/printing. However, I can see structures like ELFYAML::StackSizesSection in lib/ObjectYAML/ELFYAML.h which could be simply passed around.

Only tangentially relatedly, I’ve only just seen your previous patch/email thread, and I do have one thing I’d like to ask if it can be changed. At the moment, the section type is SHT_PROGBITS, but I think it would be better, if possible, to define a new SHT_* type for the new section? In general, it is bad design to rely on section names to distinguish between different kinds of sections - this requires the linker and other tools to have to do unnecessary string comparisons, which are slower and messier than switching on the sh_type field.

Yes, having a specific ELF section type would be great. Though I am clueless about how this change will be received given that the “.stack_sizes” section also uses SHT_PROGBITS. I can definitely look into it if I get some assurance.

James,
Thanks for the detailed response. Please see my thoughts inline.

Hi Rahman,

Traditionally, the ability to read sections is a feature added to llvm-readobj/llvm-readelf. For some sections, it delegates to methods in places like the Object library and BinaryFormat, but for the more specialised sections, it typically has code local to itself doing the work. The same is true for other dumping tools like obj2yaml and llvm-objdump, which means in some cases, we have multiple varieties of parsers for the same thing. I’m not sure there’s necessarily a strong motivation for doing so, however, so I’d be happy to support functionality being added elsewhere in one of those libraries, which tools like obj2yaml and llvm-readobj can hook into. I’m also happy to support refactoring that improves code reuse within the tools, though I don’t have any further ideas on this.

Can I ask what your motivation for using obj2yaml is in this context? If it’s just for testing purposes, adding support to llvm-readobj/llvm-readelf would be the more normal way, as it allows you to dump just that section.

Other than testing, we currently have code in an external tool called create_llvm_prof for parsing the “.bb_addr_map” section (+Han Shen who’s the main developer of that tool) and loading it in memory. It would’ve been great if we could just link with an LLVM library which includes the data-structure and parsing support. It looks like llvm-readelf is more structured around dumping/printing. However, I can see structures like ELFYAML::StackSizesSection in lib/ObjectYAML/ELFYAML.h which could be simply passed around.

The structures within the ObjectYAML library are really for serializing/deserializing the various section types. They’re not really designed with any other thought in mind. Whilst they could work, the usage of things like Optionals to represent a field that may or may not be present in the YAML, with a default value if it isn’t, and the various data types within the yaml namespace (e.g. Hex64 etc for distinguishing between numbers that should be represented in decimal or hex) makes these unsuitable for more general purposes. It’s quite likely that a representation could be shared between ObjectYAML and other libraries, but I don’t think it would belong in that code, and it would need to be adapted for the serialization process realistically, to maintain the expressiveness of the existing structs.

Only tangentially relatedly, I’ve only just seen your previous patch/email thread, and I do have one thing I’d like to ask if it can be changed. At the moment, the section type is SHT_PROGBITS, but I think it would be better, if possible, to define a new SHT_* type for the new section? In general, it is bad design to rely on section names to distinguish between different kinds of sections - this requires the linker and other tools to have to do unnecessary string comparisons, which are slower and messier than switching on the sh_type field.

Yes, having a specific ELF section type would be great. Though I am clueless about how this change will be received given that the “.stack_sizes” section also uses SHT_PROGBITS. I can definitely look into it if I get some assurance.

I don’t think there would be any pushback at all - if you take a look at the ELF.h in BinaryFormat, you’ll see several other existing LLVM-specific section type values. If the section type is something that would be more generically useful, it might actually be best to propose it elsewhere e.g. on the GNU mailing list, or even the ELF gABI list, if it is completely general, in which case a different name and possibly value for the type would be appropriate. I think I’m probably not the only one to consider .stack_sizes being specified by name only to be a mis-step. We might change it in the future, but then we’d probably need to keep support for both variations for a while after that for compatibility reasons.

James,
Thanks for the detailed response. Please see my thoughts inline.

Hi Rahman,

Traditionally, the ability to read sections is a feature added to llvm-readobj/llvm-readelf. For some sections, it delegates to methods in places like the Object library and BinaryFormat, but for the more specialised sections, it typically has code local to itself doing the work. The same is true for other dumping tools like obj2yaml and llvm-objdump, which means in some cases, we have multiple varieties of parsers for the same thing. I’m not sure there’s necessarily a strong motivation for doing so, however, so I’d be happy to support functionality being added elsewhere in one of those libraries, which tools like obj2yaml and llvm-readobj can hook into. I’m also happy to support refactoring that improves code reuse within the tools, though I don’t have any further ideas on this.

Can I ask what your motivation for using obj2yaml is in this context? If it’s just for testing purposes, adding support to llvm-readobj/llvm-readelf would be the more normal way, as it allows you to dump just that section.

Other than testing, we currently have code in an external tool called create_llvm_prof for parsing the “.bb_addr_map” section (+Han Shen who’s the main developer of that tool) and loading it in memory. It would’ve been great if we could just link with an LLVM library which includes the data-structure and parsing support. It looks like llvm-readelf is more structured around dumping/printing. However, I can see structures like ELFYAML::StackSizesSection in lib/ObjectYAML/ELFYAML.h which could be simply passed around.

The structures within the ObjectYAML library are really for serializing/deserializing the various section types. They’re not really designed with any other thought in mind. Whilst they could work, the usage of things like Optionals to represent a field that may or may not be present in the YAML, with a default value if it isn’t, and the various data types within the yaml namespace (e.g. Hex64 etc for distinguishing between numbers that should be represented in decimal or hex) makes these unsuitable for more general purposes. It’s quite likely that a representation could be shared between ObjectYAML and other libraries, but I don’t think it would belong in that code, and it would need to be adapted for the serialization process realistically, to maintain the expressiveness of the existing structs.

I see your point that these structures are mainly intended for representation purposes. Deserialization is all we need. Optional field is OK, but we don’t really need the default value. I guess the default value, Hex64, etc are introduced for representation and your concern is that we might end up mixing representation with semantics.

Only tangentially relatedly, I’ve only just seen your previous patch/email thread, and I do have one thing I’d like to ask if it can be changed. At the moment, the section type is SHT_PROGBITS, but I think it would be better, if possible, to define a new SHT_* type for the new section? In general, it is bad design to rely on section names to distinguish between different kinds of sections - this requires the linker and other tools to have to do unnecessary string comparisons, which are slower and messier than switching on the sh_type field.

Yes, having a specific ELF section type would be great. Though I am clueless about how this change will be received given that the “.stack_sizes” section also uses SHT_PROGBITS. I can definitely look into it if I get some assurance.

I don’t think there would be any pushback at all - if you take a look at the ELF.h in BinaryFormat, you’ll see several other existing LLVM-specific section type values. If the section type is something that would be more generically useful, it might actually be best to propose it elsewhere e.g. on the GNU mailing list, or even the ELF gABI list, if it is completely general, in which case a different name and possibly value for the type would be appropriate. I think I’m probably not the only one to consider .stack_sizes being specified by name only to be a mis-step. We might change it in the future, but then we’d probably need to keep support for both variations for a while after that for compatibility reasons.

I can start working on that change right away. I think this might even be useful for deserialization (taking SHT_LLVM_CALL_GRAPH_PROFILE as example).
1- Add a section type like SHT_LLVM_BB_ADDR_MAP.
2- Add appropriate structures in Object/ELF.h to deserialize that section.
3- Add a custom function like getBBAddrMapContent to parse that section (getSectionContentsAsArray won’t work because the parsing is more involved).

We can then call this code for both testing and deserialization in external tools. What do you think?

James,
Thanks for the detailed response. Please see my thoughts inline.

Hi Rahman,

Traditionally, the ability to read sections is a feature added to llvm-readobj/llvm-readelf. For some sections, it delegates to methods in places like the Object library and BinaryFormat, but for the more specialised sections, it typically has code local to itself doing the work. The same is true for other dumping tools like obj2yaml and llvm-objdump, which means in some cases, we have multiple varieties of parsers for the same thing. I’m not sure there’s necessarily a strong motivation for doing so, however, so I’d be happy to support functionality being added elsewhere in one of those libraries, which tools like obj2yaml and llvm-readobj can hook into. I’m also happy to support refactoring that improves code reuse within the tools, though I don’t have any further ideas on this.

Can I ask what your motivation for using obj2yaml is in this context? If it’s just for testing purposes, adding support to llvm-readobj/llvm-readelf would be the more normal way, as it allows you to dump just that section.

Other than testing, we currently have code in an external tool called create_llvm_prof for parsing the “.bb_addr_map” section (+Han Shen who’s the main developer of that tool) and loading it in memory. It would’ve been great if we could just link with an LLVM library which includes the data-structure and parsing support. It looks like llvm-readelf is more structured around dumping/printing. However, I can see structures like ELFYAML::StackSizesSection in lib/ObjectYAML/ELFYAML.h which could be simply passed around.

The structures within the ObjectYAML library are really for serializing/deserializing the various section types. They’re not really designed with any other thought in mind. Whilst they could work, the usage of things like Optionals to represent a field that may or may not be present in the YAML, with a default value if it isn’t, and the various data types within the yaml namespace (e.g. Hex64 etc for distinguishing between numbers that should be represented in decimal or hex) makes these unsuitable for more general purposes. It’s quite likely that a representation could be shared between ObjectYAML and other libraries, but I don’t think it would belong in that code, and it would need to be adapted for the serialization process realistically, to maintain the expressiveness of the existing structs.

I see your point that these structures are mainly intended for representation purposes. Deserialization is all we need. Optional field is OK, but we don’t really need the default value. I guess the default value, Hex64, etc are introduced for representation and your concern is that we might end up mixing representation with semantics.

Only tangentially relatedly, I’ve only just seen your previous patch/email thread, and I do have one thing I’d like to ask if it can be changed. At the moment, the section type is SHT_PROGBITS, but I think it would be better, if possible, to define a new SHT_* type for the new section? In general, it is bad design to rely on section names to distinguish between different kinds of sections - this requires the linker and other tools to have to do unnecessary string comparisons, which are slower and messier than switching on the sh_type field.

Yes, having a specific ELF section type would be great. Though I am clueless about how this change will be received given that the “.stack_sizes” section also uses SHT_PROGBITS. I can definitely look into it if I get some assurance.

I don’t think there would be any pushback at all - if you take a look at the ELF.h in BinaryFormat, you’ll see several other existing LLVM-specific section type values. If the section type is something that would be more generically useful, it might actually be best to propose it elsewhere e.g. on the GNU mailing list, or even the ELF gABI list, if it is completely general, in which case a different name and possibly value for the type would be appropriate. I think I’m probably not the only one to consider .stack_sizes being specified by name only to be a mis-step. We might change it in the future, but then we’d probably need to keep support for both variations for a while after that for compatibility reasons.

I can start working on that change right away. I think this might even be useful for deserialization (taking SHT_LLVM_CALL_GRAPH_PROFILE as example).
1- Add a section type like SHT_LLVM_BB_ADDR_MAP.
2- Add appropriate structures in Object/ELF.h to deserialize that section.
3- Add a custom function like getBBAddrMapContent to parse that section (getSectionContentsAsArray won’t work because the parsing is more involved).

We can then call this code for both testing and deserialization in external tools. What do you think?

That sounds about right to me. You’ll want to make sure you’ve got something that can test the values and parsing early on - gtest unit tests might be sufficient. You are probably also best off, if it’s not too involved, adding yaml2obj support early. That way it’s easier to create test inputs for e.g. llvm-readobj and obj2yaml.