Ideas for removing test binaries from the git repo

We have a lot of binary files in the llvm-project repository that are used as inputs for test cases. This is one of the things we get a low score on the security scorecard is the presence of binary files in our repository. The main reason that having binaries in the repository is discouraged is because they cannot be reviewed and inspected in the same way that code can, so it can be difficult to determine if they contain malicious code.

I’m looking for ideas on how we could remove these files from the repository while still maintaining our current test coverage, and if this isn’t possible if anyone has any ideas for how we could support a checkout/build mode where the project binaries can be built without having any of these binary files on the builder system.

Thanks,
Tom

Some of the files could be removed with some effort. I remember looking through some instances of binary files for testing the binary utilities (llvm-objdump, llvm-readobj, etc.), and it seems like they were checked in before we had proper tooling like yaml2obj and obj2yaml. These tests can probably be rewritten without too much effort, but that effort does need to be made. I’m not sure if yaml2obj or obj2yaml need to be extended for some of these cases though.

There are other instances where the binary form is an essential part of the test, like the bitcode compatibility tests. For these we’re essentially testing that we can load bitcode from an older version of LLVM, so we need to save the binary unless we have a checkout and builds of multiple LLVM versions readily available.

Then there are a couple other instances. Memprof has a couple profiles checked in that are binary for testing purposes due to the lack of a textual profile format (from what I believe, someone correct me if this is incorrect). There are also maybe a couple ML models check in (I believe ir2native is the only one) for testing purposes. That specific instance has been on the docket to be removed for a while though, and if we wanted to keep it in, could probably be rewritten to work with a model generated by a python script (as some of the others already are).

This list is definitely not exhaustive, just the examples I remember seeing at various points. Additionally it only covers stuff in llvm/.

I think with some reasonable effort, we could get rid of quite a few of the binary files present. Some have to remain for testing purposes like the bitcode compatibility tests. Others will require more effort.

A quick and dirty approach to a checkout that doesn’t include any arbitrary binaries that seems to work is just deleting the test directories, given that’s where the majority/all of the binaries are from what I can tell. This requires a one line patch to the CMake to get working for llvm/. That of course prevents any testing however.

It also would be possible (and probably even useful) to have obj2yaml, and yaml2obj support for llvm Bitcode. I use those tools a lot for inspecting binary files because the yaml representations tend to print more literal representations of the binary files than objdump. Having the same support for Bitcode could be handy.

But we don’t execute these binaries or include them in our project right? They are just “input test data”?
What is the impact? Is this just about a checkbox on an overly rigid scorecard, or is there a real issue here?

2 Likes

It’s still a risk, because someone could modify the build system in a subtle way to execute one of the binaries instead of reading them as test input.

e.g.

RUN: llvm-readelf %s/Inputs/malicious_binary | FileCheck
RUN: llvm-readelf; %s/Inputs/malicious_binary | FileCheck

In a large patch, this kind of change could easily be missed by a reviewer, and could be played off as a simple mistake by the contributor.

With the recent xz incident, the attack was executed using malicious test inputs, so I don’t think the scorecard is being overly rigid in this case.

4 Likes

Thanks Tom!

Wouldn’t rm -Rf */test before building the toolchain eliminate this risk? Or just directly Git - git-sparse-checkout Documentation ?
(the example you mention about lit command line being tweaked is only about running test, not really building the project)

Edit: oh you mentioned it initially:

if this isn’t possible if anyone has any ideas for how we could support a checkout/build mode where the project binaries can be built without having any of these binary files on the builder system.

So yeah, git sparse checkout seems to be what you need on a bot that builds your binaries I suspect (assuming that would please the scorecard).

And @boomanaiden154-1 mentioned it as well:

A quick and dirty approach to a checkout that doesn’t include any arbitrary binaries that seems to work is just deleting the test directories, given that’s where the majority/all of the binaries are from what I can tell.

The scorecard would still complain, but that’s OK. I’m more looking for a solution for distributors.

I wonder if having a no tests tarball would be useful too?

If people would use it, sure. It’s relatively little effort and if we put it somewhere like attached to the Github Release, there aren’t really any costs associated with that and the infrastructure to publish it is relatively trivial.

I never worked on packging for anything super security focused, but when I did packaging for a linux distro, I would’ve much rather had tests available than have them not available due to the lack of easy auditing of binary input data in tests. Others might have different opinions however.

Other solutions that don’t involve fixing all the tests that currently use binary inputs are probably easier/more timely.

But if you’re interested in fixing all the tests - yeah, as others have mentioned, yaml2obj improvements to make it easier to author tests that way (though a textual dump of binary data doesn’t make this any more secure in terms of code review, etc - it’s still totally opaque, so we’d want more of the high level yaml2obj descriptions so that the intent is more legible than hex/base64/whatever encoded binary data).

Some of the old debug info parsing (llvm-dwarfdump) tests used checked in binary objects before we conceded that checking in assembly and assembling it on the fly was good enough. So a bunch of old tests could be ported to that approach.

I think some of the early llvm-dwp tests probably used checked-in dwp files as inputs because there’s no tools for producing dwp files as assembly code. I guess llvm-dwp could be modified to produce such output, maybe (though since it doesn’t parse many of its input/output sections, they might still be pretty illegible unless we added a mode where it does parse and then produce structured, annotated output assembly for those sections), and use that as test input instead of the checked in binary .dwp files.

I believe many PDB binary files were added to LLVM during the process of developing the PDB libraries. This was necessary at the time because we had not bootstrapped PDB-file-writing functionality, and if we checked in only the build scripts to run MSVC to generate the files, we would only be able to run tests in particular build environments, and we would not be able to test PDB writing on non-MSVC environments. We have several lld-link tests that test for compatibility with MSVC-style PCH objects, and it’s not always possible to replicate these artifacts without an MSVC toolchain. Our YAML pdb conversion is not always 100% faithful.

I think today we could go back and reengineer most of these tests to use YAML inputs or lld-link outputs, but it raises the general question of how to write tests for reading undocumented binary file formats. We can check them in as base64 encoded text or ASCII hex or something, but it’s just as easy to hide malicious inputs in those formats. Do folks feel like that would be a meaningful improvement? If we need to meet this requirement, should we make it our project policy to always use ASCII encodings of binary inputs when they are required for testing?

I don’t think “tricking” the automated report generator by e.g. ascii-encoding opaque binary data (in base64 or equivalent) would be of any use to anyone.

I think the only viable way to resolve this concern is by having a fully-reproducible build that results in the same output whether the **/test and **/unittest directories are present in the build environment or not.

Of course, as the xz attack showed, even if we are validating this property of the build in LLVM’s build/CI systems, an attacker could condition their attack to only ever trigger on a distro’s downstream package build. So, to be safe, all downstream users would need to independently validate that the test data doesn’t affect the build output in each of their own build environments. Which is an unfortunate burden, but I can’t really see an alternative.

Could we start with the opposite? How to prevent people from adding more binaries?
Add to LLVM Developer Policy — LLVM 20.0.0git documentation a section on binaries:

  • use YAML2OBJ
  • compile assembler to objects and use llvm-readelf for testing
  • more creative ways

I was think this as well. We could easily enforce it with a PR status check as well.

I think if we can eliminate most of the remaining binary tests with yaml2obj/obj2yaml, then “vet” the few remaining ones (if their number is sufficiently small) and prevent addition of the new ones, it’d be good enough.

I’m not in favor of changing our policy yet without some accommodation for the general use case of writing tests for binary file format reading tools. Surely we are going to do this again at some point in the lifetime of the project.

Even if we have a heavy handed policy, where we just exile such tests from the monorepo, or say that is our long term goal without any followup, that’s a real answer. It at least points the way to what we plan to do with the long tail of binary test inputs.

This seems like the best way to start to me. However, I think we need to first decide if we want to go in this direction at all. How motivated are people to remove binaries from the repository? It does seem like it would make writing test cases harder. Is that cost worth the security benefits from having a binary free repo?

Convert the binaries to hex encoded ASCII. Executing that in malicious fashion won’t do anything, they’re fractionally less annoying to read in an editor, converting back to bin on the fly is trivial and cheap.

That’ll make the no binary files judgement thing pass, costs no test coverage, doesn’t block us adding more “binary” test inputs in the future. We get all the security benefits with zero ongoing cost and close to zero initial cost.

I’ve been doing this with binary test files in a proprietary system for ages, though to make git diff somewhat useful as opposed to on security grounds. Works great.

I’m not sure this is feasible for all the binaries within the repository. There are checked in binaries that are on the order of high single digit MB. Converting to hex encoded ASCII makes these binaries even more expensive to store than they already are.

These binaries are limited to a couple cases though and some of them (like the checked in ML models) should probably be removed entirely.

I think that some of the conversions, especially converting object files that predate yaml2obj into *.yaml files would be good even outside of reducing test binaries. This not only helps make things more auditable as it gets rid of binaries, but it also makes the tests significantly more understandable (you can read the input instead of needing to decipher it in a hex editor). I think this would be the case for the vast majority of test cases that currently utilize raw binaries as inputs.

I think there are two big hurdles to actually getting this done though. One is that new tooling would need to be developed/existing tooling extended. If we want to have a yaml2obj equivalent for Bitcode for example, we will need tooling to actually produce bitcode files from a textual format. yaml2obj and friends would also probably need some work to support all the cases currently present in the repository. Then there’s the actual hurdle of rewriting the tests. There are at least some tests where the first hurdle isn’t a blocker, and they just haven’t been rewritten yet as no one has gotten around to it.

I can’t think of any super obvious downsides to cleaning these up within the repository, but it will probably require a concentrated effort by someone/a group that wants to see this through. I think patches removing binaries would for the most part be appreciated.

I’d rather double the size of the files than lose the test coverage. Changing from binary to the ASCII yaml that constructs elf is probably bigger on disk than the ASCII hex anyway.

Not sure how this avoids the security problem. The malicious actor would just as easily “convert back to bin on the fly” for their purposes.