Using `__attribute__((format))` on the `format` function

Hi LLVM,

In my spare time over the last year, I’ve been improving __attribute__((format)) to make -Wformat-nonliteral more useful. -Wformat-nonliteral catches serious issues, such as accidentally using user-provided strings as format strings. Despite the embarrassment that it can save, unfortunately, the diagnostic gets little use. I believe that improving __attribute__((format)) can make -Wformat-nonliteral a lot more useful, to a point that–dare I dream–platform owners who care could conceivably enable it by default.

In clang-15, one of the biggest __attribute__((format)) improvements we made is the ability to use the attribute on non-variadic functions. This unblocks many uses, with two salient ones being:

  • annotating legacy functions in the style of the ancient __eprintf
  • annotating C++ variadic function templates that forward their arguments to a printf-family function

The latter is the interesting one for LLVM itself, since we define a format function in llvm/include/lib/Support/Format.h that does just that and that has over a thousand uses in the code base.

Since clang-15 landed, I did some work on applying __attribute__((format)) to our format function and found somewhere north of a hundred issues. These are some of the most common problems:

  • using a 64-bit specifier to print a 32-bit value (this coincidentally works in 64-bit builds but goes sideways in 32-bit builds)
  • using a size_t for a * precision or width specifier, which requires an int (this usually works in 32-bit builds but huge values will make it go sideways in 64-bit builds)
  • using %p to print values which are not pointers, and which are occasionally not pointer-sized altogether

There are also a few endearing issues, such as code confusing llvm::format with std::format and printing {:x8} to a debug stream. Additionally, there’s been an instance or two of missing format arguments or passing in too many format arguments. There’s only one instance of a format string coming from “user input”, in tablegen, and I don’t think that it’s a security hazard because the user input in this case is a .td file that ships along with the rest of the LLVM code (and we still found a safe way to handle it).

I’m starting this thread because I’m not sure where to go from here. Obviously, I’d like to contribute these fixes back to LLVM. However, I have one technical issue and one process issue that I’d like to discuss.

The technical issue, which is hopefully easily resolved, is that while gcc and older clangs support __attribute__((format)), they don’t support it on non-variadic functions. This means we need to find a way to figure out if using __attribute__((format)) in this way is going to work or not, and we can’t use __has_attribute(format) for it. I don’t think that it should be tested by CMake and added to llvm-config.h because that header can then be distributed and used with other compilers, which could result in the attribute being used when it shouldn’t, or it not being used when it could be.

The most reasonable solution seems to be to test for the feature with something like __has_extension(nonvariadic_format_attribute). However, I don’t think that there is precedent for __has_extension to test whether an extension to an extension is present. I’d like to hear how clang folks feel about that.

The process issue is that the change is sort of big and I’m not sure how to break it down so that the right folks review it. While the diff itself isn’t very large, it touches a fair spread of modules:

  • clang/lib/AST/Interp/Disasm.cpp
  • clang/lib/CodeGen/CodeGenModule.cpp
  • llvm/include/llvm/Support/Format.h
  • llvm/lib/CodeGen/MachinePipeliner.cpp
  • llvm/lib/DebugInfo/DWARF (3 files)
  • llvm/lib/DebugInfo/LogicalView (7 files)
  • llvm/lib/ExecutionEngine (7 files)
  • llvm/lib/MC/MCInstPrinter.cpp
  • llvm/lib/Object/MachOObjectFile.cpp
  • llvm/lib/ProfileData/GCOV.cpp
  • llvm/lib/Target/{ARM,Mips,PowerPCX86} (one file each)
  • llvm/tools/llvm-cov/CoverageReport.cpp
  • llvm/tools/llvm-objdump (2 files)
  • llvm/tools/llvm-profdata/llvm-profdata.cpp
  • llvm/tools/llvm-profgen/PerfReader.h
  • llvm/tools/llvm-size/llvm-size.cpp
  • llvm/utils/TableGen (2 files)

As I’m a new-ish contributor, I’d like to get advice on how to proceed with this to avoid needless change separation overhead while still getting the right people to look at what’s changing.

These two can be tackled in parallel, as I can contribute fixes to format warnings even if the warning isn’t enabled upstream. However, it might make verifying the fixes more tedious and it might cause a little extra churn if people accidentally introduce more incorrect uses of format while we’re working on those.

Thanks!

1 Like

Regardless of the attribute issue, I think this shows why llvm::format should transition from dangerous printf-style to std::format-like interface (ideally being replaced by it entirely once the compiler support is there). Not to mention the potential confusion it causes now (like in the code you linked).