[RFC] ObjectYAML with Coverage Map sections

  • Enhance ObjectYAML to handle covmap sections.
    • obj2yaml can dump optionally covmap sections as prettyprinted.
    • yaml2obj can parse and emit covmap described by YAML.

Backgrounds

llvm/test/tools/llvm-cov/Inputs has some object files. Since they are assumed as blobs, it would be not easy to know what they are. Especially for updating, it would be difficult to know what would be changed.

We can know what they are and track them if they are describe in text-formatted YAML. Changes will be shown as diff. We can expect cleaner view for coming binary format changes.

Currently, obj2yaml dumps covmap-specific sections as blobs. obj2yaml may show understand covmap sections and dump them as prettyprinted. yaml2obj may be used for generating object files for testing.

There is the discussion for removing test binaries.

Requirements

(by importance)

  • yaml2obj should emit object files that are available as llvm-cov tests.
  • obj2yaml should emit YAML files that can be understood and parsed by yaml2obj.
  • obj2yaml should emit covmap sections as optionally prettyprinted.
  • YAML structure may be tolerant of content changes. It is better if its diff view is simple.
  • YAML structure may be flexble and applicable not only for the current version but also past and future versions.

Steps

I am implementing and testing a prototype.

Enhance ObjectYAML

Since ObjectYAML is less extensible for plugins, enhancements may be implemented into ObjectYAML. Changes for tools (obj2yaml and yaml2obj) will be lesser.

Migrate llvm-cov tests to yaml2obj

FYI, we can reduce test inputs to reduce sections to __llvm_prf_names, __llvm_covmap, and __llvm_covfun. It can be;

llvm-objcopy \
    --only-section=__llvm_covfun \
    --only-section=__llvm_covmap \
    --only-section=__llvm_prf_names \
    --strip-unneeded

Make YAML output more prettyprinted

Hash values and opaque indices are less readable. I think YAML may be decorated more with information in other sections.

I donā€™t think it would be our goal to make YAML more human-readable as possible. Itā€™d be more important for development and CI.

Integrate covmap encoder and decoder with LLVMCoverageā€™s

This may be the future work. Currently my prototype implementations is individual.

Enhance YAML for non-current covmap versions

This may be the future work.

Appendix: Example YAML snippets

--- !ELF
FileHeader:
  Class:           ELFCLASS64
  Data:            ELFDATA2LSB
  OSABI:           ELFOSABI_GNU
  Type:            ET_REL
  Machine:         EM_X86_64
  SectionHeaderStringTable: .strtab
Sections:
  - CustomSectionName: __llvm_covfun
    CovFun:
      - FnRef:           0x92FEBEFF6DCBC98A
        Signature:       0x8844F3EA913AC6F7
        FilenamesRef:    0x936D2263272E38EB
        fns:             [ 1 ]
        Expressions:
          - [ { Ref: 0 }, { Ref: 3 } ]
          - [ { Ref: 3 }, { Ref: 4 } ]
          - [ { Ref: 2 }, { Ref: 5 } ]
          - [ { Ref: 5 }, { Ref: 6 } ]
          - [ { Ref: 0 }, { Ref: 8 } ]
          - [ { Ref: 8 }, { Ref: 9 } ]
          - [ { Ref: 7 }, { Ref: 11 } ]
          - [ { Ref: 11 }, { Ref: 12 } ]
          - [ { Ref: 0 }, { Ref: 15 } ]
          - [ { Ref: 15 }, { Ref: 16 } ]
          - [ { Ref: 14 }, { Ref: 17 } ]
          - [ { Ref: 17 }, { Ref: 18 } ]
        Records:
          - { File: 0, dLoc: [ 10, 43, 11, 2 ], Ref: 0 }
          - { File: 0, dLoc: [ 1, 1, 0, 1 ], Skip: {  } }
          - { File: 0, dLoc: [ 1, 7, 0, 15 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 0, 27 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 0, 27 ], Decision: { BIdx: 7, NCond: 4 } }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Branch: { True: { Ref: 3 }, False: { Sub: 0 }, MCDC: [ 1, 3, 2 ] } }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Ref: 3 }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Branch: { True: { Ref: 4 }, False: { Sub: 1 }, MCDC: [ 3, 0, 2 ] } }
          - { File: 0, dLoc: [ 0, 19, 0, 27 ], Ref: 2 }
          - { File: 0, dLoc: [ 0, 20, 0, 21 ], Ref: 2 }
          - { File: 0, dLoc: [ 0, 20, 0, 21 ], Branch: { True: { Ref: 5 }, False: { Sub: 2 }, MCDC: [ 2, 4, 0 ] } }
          - { File: 0, dLoc: [ 0, 25, 0, 26 ], Ref: 5 }
          - { File: 0, dLoc: [ 0, 25, 0, 26 ], Branch: { True: { Ref: 6 }, False: { Sub: 3 }, MCDC: [ 4, 0, 0 ] } }
          - { File: 0, dLoc: [ 0, 28, 1, 5 ], isGap: true, Ref: 1 }
          - { File: 0, dLoc: [ 1, 5, 0, 36 ], Ref: 1 }
          - { File: 0, dLoc: [ 1, 1, 0, 1 ], Skip: {  } }
          - { File: 0, dLoc: [ 1, 7, 0, 8 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 0, 13 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 0, 13 ], Decision: { BIdx: 10, NCond: 2 } }
          - { File: 0, dLoc: [ 0, 7, 0, 8 ], Branch: { True: { Ref: 8 }, False: { Sub: 4 }, MCDC: [ 1, 2, 0 ] } }
          - { File: 0, dLoc: [ 0, 12, 0, 13 ], Ref: 8 }
          - { File: 0, dLoc: [ 0, 12, 0, 13 ], Branch: { True: { Ref: 9 }, False: { Sub: 5 }, MCDC: [ 2, 0, 0 ] } }
          - { File: 0, dLoc: [ 0, 14, 0, 15 ], isGap: true, Ref: 7 }
          - { File: 0, dLoc: [ 0, 15, 1, 36 ], Ref: 7 }
          - { File: 0, dLoc: [ 0, 19, 0, 20 ], Ref: 7 }
          - { File: 0, dLoc: [ 0, 19, 0, 25 ], Ref: 7 }
          - { File: 0, dLoc: [ 0, 19, 0, 25 ], Decision: { BIdx: 13, NCond: 2 } }
          - { File: 0, dLoc: [ 0, 19, 0, 20 ], Branch: { True: { Ref: 11 }, False: { Sub: 6 }, MCDC: [ 1, 2, 0 ] } }
          - { File: 0, dLoc: [ 0, 24, 0, 25 ], Ref: 11 }
          - { File: 0, dLoc: [ 0, 24, 0, 25 ], Branch: { True: { Ref: 12 }, False: { Sub: 7 }, MCDC: [ 2, 0, 0 ] } }
          - { File: 0, dLoc: [ 0, 26, 1, 5 ], isGap: true, Ref: 10 }
          - { File: 0, dLoc: [ 1, 5, 0, 36 ], Ref: 10 }
          - { File: 0, dLoc: [ 1, 1, 0, 1 ], Skip: {  } }
          - { File: 0, dLoc: [ 1, 7, 0, 15 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 1, 15 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 7, 1, 15 ], Decision: { BIdx: 18, NCond: 4 } }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Ref: 0 }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Branch: { True: { Ref: 15 }, False: { Sub: 8 }, MCDC: [ 1, 3, 0 ] } }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Ref: 15 }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Branch: { True: { Ref: 16 }, False: { Sub: 9 }, MCDC: [ 3, 2, 0 ] } }
          - { File: 0, dLoc: [ 1, 7, 0, 15 ], Ref: 14 }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Ref: 14 }
          - { File: 0, dLoc: [ 0, 8, 0, 9 ], Branch: { True: { Ref: 17 }, False: { Sub: 10 }, MCDC: [ 2, 4, 0 ] } }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Ref: 17 }
          - { File: 0, dLoc: [ 0, 13, 0, 14 ], Branch: { True: { Ref: 18 }, False: { Sub: 11 }, MCDC: [ 4, 0, 0 ] } }
          - { File: 0, dLoc: [ 0, 16, 1, 5 ], isGap: true, Ref: 13 }
          - { File: 0, dLoc: [ 1, 5, 0, 36 ], Ref: 13 }
  - CustomSectionName: '__llvm_covfun (1)'
    CovFun:
      - FnRef:           0xDB956436E78DD5FA
        Signature:       0x18
        FilenamesRef:    0x936D2263272E38EB
        fns:             [ 1 ]
        Expressions:     []
        Records:
          - { File: 0, dLoc: [ 24, 1, 12, 2 ], Ref: 0 }
          - { File: 0, dLoc: [ 5, 1, 0, 1 ], Skip: {  } }
          - { File: 0, dLoc: [ 4, 1, 0, 1 ], Skip: {  } }
  - CustomSectionName: __llvm_covmap
    CovMap:
      - FnBlobHash:      0x936D2263272E38EB
        WD:              ''
        Filenames:
          - mcdc-general.cpp
  - CustomSectionName: __llvm_prf_names
    PrfNames:
      - Names:
          - _Z4testbbbb
          - main
  - Type:            SectionHeaderTable
    Sections:
      - Name:            .strtab
      - Name:            __llvm_covfun
      - Name:            '__llvm_covfun (1)'
      - Name:            __llvm_covmap
      - Name:            __llvm_prf_names
      - Name:            .symtab
Symbols:
  - Name:            __llvm_covmap
    Type:            STT_SECTION
    Section:         __llvm_covmap
  - Name:            __llvm_prf_names
    Type:            STT_SECTION
    Section:         __llvm_prf_names
  - Name:            __covrec_92FEBEFF6DCBC98Au
    Type:            STT_OBJECT
    Section:         __llvm_covfun
    Binding:         STB_WEAK
    Size:            0x17C
    Other:           [ STV_HIDDEN ]
  - Name:            __covrec_DB956436E78DD5FAu
    Type:            STT_OBJECT
    Section:         '__llvm_covfun (1)'
    Binding:         STB_WEAK
    Size:            0x2F
    Other:           [ STV_HIDDEN ]
...
1 Like

Iā€™m not familiar with the content of these sections, but I am familiar with the ELF YAML stuff. Iā€™m a strong believer in expanded YAML support for sections instead of using binary blobs. That being said, there may be occasions where assembly is more appropriate. A question you need to ask yourself is whether the yaml2obj support would be more useful than just using llvm-mc to turn some assembly into an object file.

Assuming YAML is the right path, this broadly seems a reasonable output to me. Do any of these entries refer to other symbols or similar that could be named instead of the raw value? If so, Iā€™d prefer to support that from the get go as itā€™ll make the YAML easier to understand and modify. Raw indexes/offsets/addresses can still be supported in addition, as this allows you to craft entries that donā€™t refer to a specific symbol etc.

Another thing to consider is which, if any, of the fields can have natural values without being explicitly specified. Many fields in other ELF sections are automatically derived from other information, or have an obvious fallback (e.g. a zero value or an empty list). These fields should then be optional in the YAML, and if unspecified, should fallback to the automatic value.

Other miscellaneous thoughts/queries:

  • ā€œCustomSectionNameā€ - this should just be ā€œNameā€, just as in many other section types supported by yaml2obj.
  • Do the entries in ā€œExpressionsā€ always consist of two Ref fields? If so, the list is unnecessary, and you could have simply ā€œRef1ā€ and ā€œRef2ā€ fields in each expression.
  • What are the empty Skip entries for in some Record entries?
2 Likes

Thanks for the comment. Iā€™m happy to hear you familiar with ELFYAML.

A question you need to ask yourself is whether the yaml2obj support would be more useful than just using llvm-mc to turn some assembly into an object file.

In the case of covmap, many records are encoded as blobs in LLVMIR.
See also, LLVM Code Coverage Mapping Format ā€” LLVM 20.0.0git documentation
We didnā€™t have better ways for generating object files but clang -fcoverage-mapping. As well, we didnā€™t have good ways for dumping covmap but clang -fcoverage-mapping -Xclang -dump-coverage-mapping.

This is the motivation why I want to implement covmap in ObjectYAML.

Do any of these entries refer to other symbols or similar that could be named instead of the raw value? If so, Iā€™d prefer to support that from the get go as itā€™ll make the YAML easier to understand and modify.

I thought it may be the later step since I wanted to land YAML inputs into llvm-cov tests at first. Iā€™ll prioritize it higher as your suggestion. :slight_smile:

My example has CovMap::FnBlobHash but it is not an actual filed but a virtual value (calculated from strings below) for convenience of debugging. This is pointed from CovFun::FilenamesRef. CovMap::fns[] (or expanded File in Records) may be Filenames in CovMap.

I think CovMap should not be dropped even if version numbers and all Filenames are resolved. It should be held for regenerating identical sections from YAML.

Another thing to consider is which, if any, of the fields can have natural values without being explicitly specified.

Good point. I supposed they might be always omitted. I rethink;

  • obj2yaml may provide an option -redundant=true to emit obvious values.
  • yaml2obj shall accept explicit values.
    • It may validate such explicit values (to emit malformed files with warnings, or errors)

Are they reasonable?

ā€œCustomSectionNameā€ - this should just be ā€œNameā€, just as in many other section types supported by yaml2obj.

Iā€™ve confirmed Name works in my prototype. Iā€™ll do the next time. Name was unavailable in my early attempts due to distinguishing section class among predefined RawContent and custom sections.

Do the entries in ā€œExpressionsā€ always consist of two Ref fields? If so, the list is unnecessary, and you could have simply ā€œRef1ā€ and ā€œRef2ā€ fields in each expression.

It may have not only the Ref value but also another Expression with Sub or Add. My example above was too simple.

        Expressions:
          - [ { Add: 1 }, { Ref: 10 } ]
          - [ { Add: 2 }, { Ref: 8 } ]
          - [ { Add: 3 }, { Ref: 7 } ]
          - [ { Add: 4 }, { Ref: 6 } ]
          - [ { Add: 5 }, { Ref: 5 } ]
          - [ { Ref: 3 }, { Ref: 4 } ]

Note that each Expression doesnā€™t have operators but only a pair of terms. I think they could be printed more human-friendly with further analyses.

The reason why I take the sequence [2] rather than the mapping {LHS, RHS} is that I want to reduce columns in the flow format.

What are the empty Skip entries for in some Record entries?

(Iā€™ve just noticed that Region is more appropriate than Record)

I wanted value-less keys (e.g. Skip: None) so I chose an empty mapping. I thought itā€™d be worse if I chose Skip: 0 or Skip: false since 0 and false were rather meaningful. Could you suggest better null values for value-less keys?

Skip is one of RegionKind(s) and doesnā€™t have counter values.

In my early attempts, I tried introducing std::variant. It was beyond my skill. Would it be worth introducing?

Please refer to prior art for existing ELF data. Youā€™ll note that we donā€™t emit ā€œobviousā€ values for other data, and I donā€™t see a need to be able to. In general, obj2yaml may be useful for getting the basis of a representation for a test, but it should not be considered the end result, as thereā€™s usually lots of cruft that has no bearing on the behaviour interested in for the test (e.g. your example uses the SectionHeaderStringTable field, but itā€™s not actually needed explicitly to generate a valid ELF that would demonstrate the llvm-cov sections).

Regarding ā€œvalidatingā€ explicit values in yaml2obj: it depends. Again, prior art is to be as permissive as possible. This allows for creating malformed inputs that can be used for exercising error case paths in the tests. We deliberately donā€™t validate these fields, because itā€™s not possible to know whether they are intentionally invalid or not. However, we do validate in some cases, where an input value cannot be represented, or cause problems in representing other parts of the object later on. In these cases, we always emit an error. To my knowledge, there are no fields where a warning is emitted, because developers will generally not see those warnings (the warning will be hidden in the test output).

I think I overlooked what you were doing before. In general, yaml2obj determines the section format (and therefore the expected fields of the section) by the Type field. E.g. a section entry with fields as follows:

- Name: .rel.foo
  Type: SHT_RELA

This tells yaml2obj that the section is a relocation section and the fields can include things like relocation entries. This is in keeping with the general ELF style that sections with special meanings should be distinguished by their types, not their names, to reduce costly string comparisons. However, not all sections follow this principle, and I guess itā€™s not always possible to retrofit a type to the section. As such, I think itā€™s reasonable to add a two-part check, if needed. First, check the type, then if itā€™s something like SHT_PROGBITS (which tends to be the section type used for ā€œgenericā€ sections), check to see if the name matches a known name that has special behaviour. Once you have this check, you can use then add behaviour to recognise the fields you have when the section matches the expected name/type combo.

Iā€™m not sure I follow what you mean by ā€œthe flow formatā€. Could you elaborate here, please?

Iā€™m concerned that a fixed-size sequence will just require more validation within the ObjectYAML code, to check that there are exactly two entries, which unnecessarily bloats the code. By having two named fields, you can use mapRequired (or whatever the name is - I havenā€™t looked at the code in a while) to have yaml2obj complain if one field is missing. I believe this would also naturally ensure there are no additional fields, but Iā€™m not 100% sure on that one.

I think the problem is that youā€™re trying to model the kinds as key/values themselves, when in fact theyā€™re a field in and of themselves, with a separate value field. In other words, I would suggest something like (written in the long-form for clarity):

        Regions:
          - File: 0
            dLoc: [ 24, 1, 12, 2 ]
            Kind: Ref
            Value: 0
          - File: 0
            dLoc: [ 5, 1, 0, 1 ]
            Kind: Skip

In this way, Value is an ā€œoptionalā€ field (in terms of the mapping type), with its presence required for some kinds and forbidden for Skip kinds (depending on how the underlying data actually looks, it may or may not be appropriate to check that Value isnā€™t specified for Skip - see my earlier comments).

I donā€™t know what you were trying to achieve, but I doubt std::variant is needed.

1 Like

Excuse me, let me answer partially and quickly.

I assume that ā€œcustom sectionsā€ are derived from RawContent, so Iā€™d like to make SHT_PROGBITS implicit (at least for ELF).

Meant, ā€œoneliner like jsonā€. Sorry since Iā€™m new to LLVM YAML. I supposed ā€œflowā€ would represent it. For Region records, I think one record per line would be easy to handle.

Iā€™m using std::array<Cntr,2> for it. Let me check tomorrow if YAMLTraits could validate such a fixed-length sequence. Iā€™d like to omit tag keys for such a simple pair.

The encoded format doesnā€™t have the explicit ā€œKindā€ key value. So I think itā€™d be better to determine the kind of Region with combination of known keys. For example,

{Branch: {True: x, False, y, ...}}

would be simpler than:

{Kind: Branch, Branch: {True: x, False, y, ...}}

Note, Branch is determined by the tag in an enhanced CounterAndTag. Then, Branch doesnā€™t have the primary Counter but the dedicated pair of Counter(s).

Iā€™m focusing to simplify Regions with ā€œflowā€ line format.

Gotta sleep. Good night.

If you havenā€™t already, please check out how other ELF sections are handled (see llvm-project/llvm/lib/ObjectYAML/ELFYAML.cpp at 86e4beb702fde407a35938a1c37279a61c0291e7 Ā· llvm/llvm-project Ā· GitHub and onwards for the core behaviour). I donā€™t think thereā€™s any reason the llvm-cov sections should be handled any differently. A lot of what youā€™re talking about is already demonstrated here. We should avoid adopting a different convention for specific section types, because thatā€™ll make them harder to maintain and potentially wonā€™t benefit from things like the commonSectionMapping behaviour.

RawContentSection is just one of the various section types, and is intended for sections where you specify a binary blob as the contents. That isnā€™t the case here. The type canā€™t be implicit if youā€™re specifying your custom section type correctly, because itā€™s from the type (and name) that the real ObjectYAML type is determined.

I wouldnā€™t recommend trying to fight with YAMLTraits, if it doesnā€™t already naturally support it. Youā€™re better off following existing patterns to achieve your goals rather than invent new ones.

Itā€™s not uncommon in ELF YAML data types to have fields that donā€™t correspond directly to ELF file format fields. This allows us to use the YAML to describe the behaviour in a straightforward way. For example, a sectionā€™s ā€œNameā€ field value is actually stored in a completely unrelated section, but we donā€™t (normally) write that sectionā€™s content in the YAML directly, and instead infer its contents from the values listed throughout the YAML document. As such, I wouldnā€™t get hung up on whether the field exists in ELF or not. Instead, Iā€™d focus on what is straightforward to implement (without costing readability in a dramatic way).

Iā€™ve created a request. This is a skeleton interface. Enhancements will follow unless any serious concerns would come.

FYI, Iā€™ve created example branches, based on my w-i-p obj2yaml.

They are reduced, converted and prettyprinted (but not regenerated) files.

They are regenerated. #112694 affects to some tests.

Letā€™s bring the conversation back to the RFC, rather than in the PR, for better visibility. @MaskRay, do you have any thoughts?

From the PR:

I have a plan to integrate covmap encoder with LLVMCoverage and YAML sections. I am proposing ā€œpluginā€ since I want to avoid for LLVMObjectYAML depending on LLVMCoverage. LLVMObjectYAML may be unaware of blob contents defined externally.

Do you think better that LLVMObjectYAML shall import LLVMCoverage stuff?

I could be wrong, but I donā€™t think LLVMObjectYAML has been designed to support a plugin mechanism and based on your PR, I donā€™t think itā€™s particularly straightforward or necessary at this stage.

I think it would be much more straightforward to allow ObjectYAML to rely on the Coverage library, if it is strictly needed. However, a part of me is wondering whether it is actually needed. For example, yaml2obj supports DWARF emission, but doesnā€™t reference the DWARF library in LLVM. What exactly in LLVMCoverage does yaml2obj need, if we are following the approach Iā€™ve outlined in the PR?

I am proposing ā€œa plugin mechanismā€. Can I suppose you are against introducing plugins? Note, I didnā€™t think my proposal could use plugins at first. I found it could be rewritten with the plugin as my progression.

It is true that it would be easier to implement covmap-specific stuff into LLVMObjectYAML. As the fact, I started prototyping in LLVMObjectYAML. I still think covmap is just a payload in ELF (and other object formats).

I will refactor CoverageReader and CoverageWriter to handle lightweight covmap records, that should be compatible to YAML records. Right now they assume llvm-cov is the only user. Since the coverage map format is versioned, I donā€™t think itā€™d be good for other reader and writer to be implemented elsewhere.

I think DWARF is not a good example for your justification since it is well-defined out of our project.

Isnā€™t that what I said? I donā€™t think itā€™s necessary: your proposal is the first time, to my knowledge, that this has come up and plenty of other custom section types specific to LLVM are supported by yaml2obj without it.

This isnā€™t unique to covmap though. DWARF is a particularly interesting example because it is supported by both ELF and Mach-O in a cross-format way (i.e. you could in theory take a DWARF block in an ELF YAML document and attach it to a Mach-O and yaml2obj would probably consume it just fine). If you are keen to support covmap in a manner that is flexible across formats, we can discuss alternative approaches that might work like the DWARF approach. However, the original proposal looked like you wanted it to look in the YAML the same as any other section (i.e. itā€™s inlined in the Sections array).

Picking up on this point, itā€™s entirely up to you, but if the format is versioned, then ideally yaml2obj would support all versions, so you could use it to test support for older and newer versions. We try to do this for other versioned section types (I believe the BB Addr stuff does this, but itā€™s been a while since I looked at that). Another point is that you have to be careful of circular testing, i.e. you assume that your reading/writing code is good because your test says so, but the test could potentially use the same faulty code for checking the behaviour.

Iā€™ve create a revised version of covmap yaml.