[RFC] Format attribute: `__attribute__((format_like(...)))`

A few years ago, I worked to apply __attribute__((format)) to non-variadic functions, plugging the most significant hole that was left at the time in the -Wformat-nonliteral story. I’m coming back today to get some feedback on improving format checking to make -Wformat-nonliteral yet more useful: accepting format strings given types that are not represented in the function signature.

-Wformat-nonliteral warns when Clang is unable to infer the format specifiers used in a format string. For instance:

func myprintf(const char *str, ...) {
    va_list ap; va_start(ap, str);
    vprintf(str, ap); // warning: format string is not a string literal
    va_end(ap);
}

We warn in this instance because we don’t know how to guarantee that str is a valid format string for the argument list. This specific instance is resolved by annotating myprintf:

__attribute__((format(printf, 1, 2)))
func myprintf(const char *str, ...) {
    va_list ap; va_start(ap, str);
    vprintf(str, ap); // no diagnostic
    va_end(ap);
}

Since call sites of myprintf will warn for incorrect format strings, the implementation of myprintf does not introduce format string unsafety and the diagnostic is not emitted.

-Wformat-nonliteral is (was?) Clang community’s pet example of a diagnostic that wouldn’t be accepted by today’s standards. This is disappointing because format string problems are security problems, continue to surface every now and then, and they are preventable. A few years ago, I surveyed the reasons -Wformat-nonliteral triggers in our code bases and found the following categories, roughly in order of most prevalent to least prevalent:

  1. a format function is missing __attribute__((format))
  2. the format string is localized
  3. the format string is unknown at the point of passing it, but it could be verified
  4. the format string formats formal arguments instead of variable arguments (for instance, void error(const char *fmt, const char *file, int line))
  5. the format string is constructed at runtime to specialized precision fields
  6. the format string is a constant but it is held in a variable

These are now all addressable:

  1. add __attribute__((format))
  2. use __attribute__((format_arg))
  3. verify the format string with fmtcheck (or your format string flavor’s equivalent), which can be implemented with __attribute__((format_arg))
  4. as of clang-15, __attribute__((format)) applies to functions without variadic arguments or fall back to solution #3
  5. use variable precision flags such as printf("%*.*f", leading, decimals, value) or fall back to solution #3
  6. constify the variable or fall back to solution #3

(Note that although I’m only listing the busy work, as a result of these changes, we often find format string problems.)

As you see, solution #3 can carry a lot of water. There’s one use case that it can cover but that I think we can have a better solution for: when a function accepts a format string with fixed arguments, but those arguments are not passed to the function. For instance:

const char *firstname;
double temperature;

void print_temperature(const char *format) {
    printf(format, firstname, temperature);
}

This can be addressed today by using printf(fmtcheck(format, "Hi %s, it's %g degrees outside"), temperature), but it defers a preventable error to runtime.

I would like to propose a new attribute, __attribute__((format_like)), to check at compile-time that a format string is appropriate for a list of argument types. The syntax would be like __attribute__((format))'s, except that instead of indicating the first format argument, you would write out a format string that the argument should be compatible with:

__attribute__((format_like(
    flavor,
    format_string_param_index,
    format_string_literal)))

Used on print_temperature, it would look like this:

__attribute__((format_like(printf, 1, "%s %g")))
void print_temperature(const char *format);

With -Wformat, clang will then verify that callers of print_temperature pass a format string with specifiers compatible with “%s %g”. Inside the body of print_temperature, format will be assumed to have %s %g parameters for the purposes of verification.

As an alternative to a format string literal, I thought of accepting a list of argument types. This makes it potentially more useful in the context of C++ callers that might want to create that list by template expansion. On the other hand, I don’t believe that this is supported at all today, and if this is the only motivation, it will never happen. Additionally, it creates substantially longer attribute spellings in the regular case, and it causes problems with -Wformat that assumes there’s a source location for the specifier to point to and propose fixits for. It’s also more likely to be something that other compilers can support without massively going out of their way.

The bar to new attributes in clang is relatively low, but the area of format attributes in particular is already fairly crowded, so I wanted to preflight this with an RFC. In particular, there is partial overlap between __attribute__((format)) used on non-variadic functions and __attribute__((format_like)). For instance, although they have slightly different failure modes for code that doesn’t work, they are equivalent in code that does work:

__attribute__((format(printf, 1, 2)))
void print_windspeed(const char *fmt, const char *firstname, double speed);

__attribute__((format_like(printf, 1, "%s %g")))
void print_windspeed(const char *fmt, const char *firstname, double speed);

However, there are still cases that can only realistically be served by format applying to non-variadic functions, such as C++ variadic template functions:

template<typename... Args>
__attribute__((format(printf, 1, 2)))
void print(const char *fmt, Args ...args);

In the end, I don’t think that it’s a problem that in some cases, you can get away with either. Also, GCC (as of 14.2) still does not accept the format attribute on non-variadic functions and makes it an error; folks that could get away with either could find it easier to work with format_like since you can either test for its existence with __has_attribute(format_like) or let GCC ignore it as unknown.

Thoughts?

:white_check_mark: Clang consensus called in this message.

On the surface, I don’t see an issue with an attribute like this, though I’m still having trouble with the motivation above.

My biggest issue is the meaning of:
pass a format string with specifiers compatible with “%s %g”

If we mean ‘exactly those-in-that-order’, I think my implementation concerns are abated, though there are obvious portability/usability issues there.

I guess my concern really is “what do we mean by compatible” here? Would we support a %p instead of a %d? Or vice-versa? etc.

Thanks Erich. I’m using “compatible” in the same sense that a function with format_arg has to return a “compatible” format string (ie: roughly the same specifiers in the same place). The specifiers have to be in the same order. The general direction is to warn that two specifiers are incompatible if values that are valid for one specifier would be invalid for the other following -Wformat rules. %p is incompatible with %d even on platforms where pointers are int-sized because printf("%p", 1) emits a -Wformat diagnostic.

Specifically, here’s the AttrDocs.td novella (where I already bikeshedded myself and decided that format_matches is a better name than format_like). There are a handful of other cases that don’t derive very well from this rule that I’m including at the bottom. For instance, %p is not compatible with %s because then you could use %s in a format string where %p is expected and we cannot guarantee that this is safe.


The format attribute is the basis for the enforcement of diagnostics in the -Wformat family, but it only handles the case where the format string is passed along with the arguments it is going to format. It cannot handle the case where the format string and the format arguments are passed separately from each other. For instance:

static const char *first_name;
static double todays_temperature;
static int wind_speed;

void say_hi(const char *fmt) {
  printf(fmt, first_name, todays_temperature); // warning: format string is not a string literal
  printf(fmt, first_name, wind_speed); // warning: format string is not a string literal
}

int main() {
  say_hi("hello %s, it is %g degrees outside");
  say_hi("hello %s, it is %d degrees outside!"); // no diagnostic
}

In this example, fmt is expected to format a const char * and a double, but these values are not passed to say_hi. Without the format attribute (which cannot apply in this case), the -Wformat-nonliteral diagnostic triggers in the body of say_hi, and incorrect say_hi call sites do not trigger a diagnostic.

To complement the format attribute, Clang also defines the format_matches attribute. Its syntax is similar to the format attribute’s, but instead of taking the index of the first formatted value argument, it takes a C string literal with the expected specifiers:

static const char *first_name;
static double todays_temperature;
static int wind_speed;

__attribute__((__format_matches__(printf, 1, "%s %g")))
void say_hi(const char *fmt) {
  printf(fmt, first_name, todays_temperature); // no dignostic
  printf(fmt, first_name, wind_speed); // warning: format specifies type 'int' but the argument has type 'double'
}

int main() {
  say_hi("hello %s, it is %3.1f degrees outside"); // no diagnostic
  say_hi("it is %g degrees outside, have a good day %s!");
  // warning: format specifies 'double' where 'const char *' is required
  // warning: format specifies 'const char *' where 'double' is required
}

The third argument to format_matches is expected to evaluate to a C string literal even when the format string would normally be a different type for the given flavor, like a CFStringRef or a NSString *.

In the implementation of a function with the format_matches attribute, format verification works as if the format string was identical to the one specified in the attribute.

At the call sites of functions with the format_matches attribute, format verification instead compares the two format strings to evaluate their equivalence. Each format flavor defines equivalence between format specifiers. Generally speaking, two specifiers are equivalent if they format values of the same type. For instance, in the printf flavor, %2i and %-0.5d are compatible. When -Wformat-signedness is disabled, %d and %u are compatible. For a negative example, %ld is incompatible with %d.

Do note the following un-obvious cases:

  • Passing NULL as the format string is accepted without diagnostics.
  • When the format string is not NULL, it cannot miss specifiers, even in trailing positions. For instance, %d is not accepted when the required format is %d %d %d.
  • Specifiers for integers as small or smaller than int (such as %hhd) are all mutually compatible because standard argument promotion ensures that integers are at least the size of int when passed as variadic arguments. With -Wformat-signedness, mixing specifier for types with a different signedness still results in a diagnostic.
  • Format strings expecting a variable modifier (such as %*s) are incompatible with format strings that would itemize the variable modifiers (such as %i %s), even if the two specify ABI-compatible argument lists.
  • All pointer specifiers, modifiers aside, are mutually incompatible. For instance, %s is not compatible with %p, and %p is not compatible with %n, and %hhn is incompatible with %s, even if the pointers are ABI-compatible or identical on the selected platform. However, %0.5s is compatible with %s, since the difference only exists in modifier flags. This is not overridable with -Wformat-pedantic or its inverse, which control similar behavior in -Wformat.

I’m hopeful we can get more comments here… I have only just returned from WG21 (and this entire RFC was basically within the time that jsut about everyone was travelling), so haven’t done a deep dive on this yet (though I’ll review the patch soon!).

I’d like to hear what others thing about this, so that we can eventually call consensus/approve this RFC.

Alright, I read more closely this time, and I’m generally in favor. I am against the solution for compatible, and would prefer that we go with identical format strings.

I’m not sure I completely grok the -Wformat rules, but basically: If you have a format_matches attribute, you must have the exact same format string ‘items’ in the exact same order (but can add modifiers I think?).

SO basically: if you have a %s %d, you have to have exactly both of those, in that order, and no more in the format_matches attribute.

Would that be acceptable for your use cases? I think we can examine requests to relax this in the future, but would rather we start as strict as possible. I see us trying to derive ‘compatible’ as being error-prone on various ABIs.

Do flags have to match? e.g., is it fine to use %d and %+d or are those non-matching?

What about other differences like whitespace? e.g., are %d %d and %d %d fine?

Or non-whitespace? e.g., is %d toodles %d acceptable, or can it only have format specifier characters?

It’s turning out that most of my concerns are related to the fact that the format specifier string is being specified in two places and can have differences between them and those differences may matter. e.g., whitespace doesn’t matter in general, but it does for scanf specifiers including [, c, or n specifiers.

My interpretation is that he uses ‘compatible’ in a way I disagree iwth, but whitespace/non-flags doesn’t matter, just the same/compatible flags in the same order.

I was on vacation last week, I’m back now.

The parts of a format specifier that do not change the type of argument that it formats do not have to match, so flags, width and precision (when non-variable) do not have to match. Length specifiers have to match (but h and hh are treated as being ‘int’ type).

There are no restrictions on non-format-specifier characters in the format_matches format string. format_matches(printf, 1, "%s") has the same effect as format_matches(printf, 1, "Hello, %s!"). Allowing arbitrary contents can be useful to document the kind of string the function is expecting.

I expect that format_matches may not be very useful for scanf, for what it’s worth. I think that it’s mostly useful for printf-type formats, like printf, NSString/CFString, os_log, BSD kprintf, etc. My expectation for scanf would be that using format_matches will ensure there is no undefined behavior due to the format string specifiers not matching the format arguments, but parsing could still end up being wildly different.

1 Like

I don’t believe that we have a mutual understanding of “compatible” versus “identical”. It would be easier for me if we processed what parts of “compatible” you don’t like. Here’s a list of as many distinct examples as I could think of, with “my answer” (the current format_matches answer) and “the -Wformat answer” (given some valid argument list for the second specifier, whether substituting with the first specifier would diagnose).

  1. Flag difference: do you consider %#x to be valid against %x? My answer is “yes”, -Wformat’s answer is “yes”.
  2. Output difference: do you consider %x to be valid against %u? My answer is “yes”, -Wformat’s answer is “yes”.
  3. Output difference: do you consider %f to be valid against %g? My answer is “yes”, -Wformat’s answer is “yes”.
  4. Sign difference: do you consider %i to be valid against %u? My answer is “yes unless -Wformat-signedness is enabled”, -Wformat’s answer is the same.
  5. Width/precision: do you consider %8.4s to be valid against %s? My answer is “yes”, -Wformat’s answer is “yes”.
  6. Variable width/precision: do you consider %*s to be valid against %i %s? My answer is “no”, -Wformat’s answer is “yes”.
  7. Length: do you consider %c to be valid against %i? My answer is “yes” (due to standard argument promotion), -Wformat’s answer is “yes”.
  8. Length: do you consider %hhi to be valid against %i? My answer is “yes” (due to standard argument promotion), -Wformat’s answer is “yes”.
  9. Length: do you consider %i to be valid against %hhi? My answer is “yes” (due to standard argument promotion), -Wformat’s answer is “yes”.
  10. Length: do you consider %hi to be valid against %i? My answer is “yes” (due to standard argument promotion), -Wformat’s answer is “yes”.
  11. Length: do you consider %hhi to be valid against %hi? My answer is “yes” (due to standard argument promotion), -Wformat’s answer is “no”.
  12. Length: do you consider %li to be valid against %i? My answer is “no”, -Wformat’s answer is “no”.
  13. Length: do you consider %lli to be valid against %li? My answer is “no”, -Wformat’s answer is “no”.
  14. Pointee: do you consider %p to be valid against %s? My answer is “no”, -Wformat’s answer is “yes unless -Wformat-pedantic is enabled”.
  15. Pointee: do you consider %s to be valid against %hhn? My answer is “no”, -Wformat’s answer is “only for non-const char pointers”.
  16. Ordering: do you consider %u %s to be valid against %2$s %1%u? My answer is “yes”, -Wformat’s answer is “yes”.
  17. Missing arguments: do you consider %i to be valid against %i %i? My answer is “no”, -Wformat’s answer is “no”. (It’s worth saying this would be safe to do.)
  18. FreeBSD kprintf: do you consider %D to be valid against %i %s? My answer is “no”, -Wformat’s answer is “yes”.
  19. os_log: do you consider %{public}s to be valid against %s? My answer is “no”, -Wformat’s answer is “yes”.

I imagine that out of these, there’s a number where I’m saying “yes” and you would like the answer to be “no”. What are they? I may be able to derive other examples from your answers, but are there other cases that I didn’t bring up where you would like the answer to be “no” and you suspect it to be “yes” at this time?

As a table instead:

# Summary Given… Replacing with… format_matches -Wformat
1 Flag printf(“%x”, x) printf(“%#x”, x) yes yes
2 Output printf(“%u”, x) printf(“%x”, x) yes yes
3 Output printf(“%f”, x) printf(“%g”, x) yes yes
4 Signedness printf(“%i”, x) printf(“%u”, x) respect -Wformat-signedness respect -Wformat-signedness
5 Width/precision printf(“%8.4s”, x) printf(“%s”, x) yes yes
6 Variable width/precision printf(“%*s”, x) printf(“%i %s”, x) no yes
7 Length printf(“%i”, x) printf(“%c”, x) yes yes
8 Length printf(“%i”, x) printf(“%hhi”, x) yes yes
9 Length printf(“%hhi”, x) printf(“%i”, x) yes yes
10 Length printf(“%hi”, x) printf(“%i”, x) yes yes
11 Length printf(“%hi”, x) printf(“%hhi”, x) yes no
12 Length printf(“%i”, x) printf(“%li”, x) no no
13 Length printf(“%li”, x) printf(“%lli”, x) no no
14 Length printf(“%s”, x) printf(“%p”, x) no respect -Wformat-pedantic
15 Length printf(“%s”, x) printf(“%hhn”, x) no if x is non-const
16 Ordering printf(“%u %s”, x, y) printf(“%2$s %1%u”, x, y) yes yes
17 Missing arguments printf(“%i %i”, x, y) printf(“%i”, x, y) no no
18 FreeBSD kprintf kprintf(“%D”, x, y) kprintf(“%i %s”, x, y) no yes
19 os_log os_log(“%{public}s”, x) os_log(“%s”, x) no yes

I guess I immediately go to “we should just accept identical only” where ‘identical’ is “exactly the same textually” for the flags (so the only differences we accept are “things between the flags”).

Thinking a bit more, I think I’m OK with anything that changes “output” only, and doesn’t affect legality/input (at least until we’ve spent some time exploring each individual exception to this policy).

That said, given your questions, I could LIVE with:

“Yes” to me, this is one where the only thing that changes is output, so I think that is OK.

Kneejerk is “no”. But since they are the same size (both unsigned), and the only difference is format, I think I’m a “Yes”.

“Yes”, I think ‘e’ ‘f’ and ‘g’ are probably fine, again, just formats for the same input type.

This is a “no” for me. This is changing how the input bytes are interpreted as a ‘type’ level.

Agreed, width/precision I am OK with.

Definitely “no”.

I’m “no” on this one. I get that this is likely safe, but until we can come up with a good amount of use cases, I am against this.

“No” again, same reason.

Again, “No”.

Same “no”.

Still “no”.

Agreed with those “No”.

DEFINITELY “No”.

“No”.

I THINK my answer here is ‘no’? But I’m not super familiar with the ‘ordering’ component. That said, I’d be 100% ok rejecting this now (and hopefully simplifying the patch?) and revisiting in a little bit once this patch has settled.

DEFINITELY no.

Agreed on “No”.

Agreed on “No”.

So we only have a few differences there? I don’t like TYPES changing as I think those are encoded in the attribute intentionally/with meaning. Floats changing output kind/precision/etc, is fine for me. Non-length ‘flags’ I think are going to be all OK to change.

Thanks. Enumerating the differences: eliminate pretty much all leniency with length modifiers, eliminate leniency for signedness changes, rejecting ordering modifiers.

I’m not attached enough to the leniency on length modifiers to make a serious argument for it. This creates some differences with how -Wformat works, but they are not very serious differences, and it’s usually an easy fix on the project side (except the %i%c case).

I would like to see if you would be open to changing your mind on the other two. Specifically:

  • Diagnostics on specifier signedness are currently controlled by -Wformat-signedness for attribute((format)). This is respected by my patch. Would you be open to keep it this way? The complexity cost is about 9 lines (see lines 7158-7166).
  • What should the user experience be when developers try to use order modifiers? Should we emit incorrect diagnostics, cross-referencing specifiers in the order they appear instead of in the specified order? Should there be a diagnostic saying “order modifiers are not currently supported by format_matches” or equivalent? Should we allow the format string without checking? None of these options seem very good to me, so I would like to see if you would be OK keeping it working. The complexity cost is a little bit more spread out but still not very much (we just record the specifier position and std::sort the specifiers by it before comparing).

What’s causing changes all over the file is that I need an extra argument to CheckFormatString, and this reverberates somewhat deeply. About 13 of about 45 change blocks are just piping through the argument, and those are made worse by the fact these haven’t been changed since clang-format became enforced. If you’re able to mostly tune out argument list changes to functions that sound like CheckFormatString, the change will look about one-third smaller.

The sign mismatch issues is one that i feel somewhat strongly about? I guess not really from a technical perspective, but more one of an ‘interface design’ perspective. It seems that if the writer of the function is expecting the signedness/unsignedness to have meaning, removing it seems improper.

At the moment, I think a “just doesn’t match” is sufficient for bullet #2.

I suspect that this is very much just “I don’t feel comfortable enough with the order feature” though.

I (mostly) agree that assuming a green field implementation, -Wformat should warn for signedness errors, but it doesn’t. I don’t know how I would explain to users that they can control signedness warnings for attribute((format)) with -Wformat-signedness, but that they cannot for attribute((format_matches)). I think that this is different from the %s%p case (where -Wformat-pedantic controls the equivalent for attribute((format)), and we always warn for attribute((format_matches))). There are frequent security implications/memory safety concerns when changing whether the pointer is dereferenced or not, or whether the pointer value is printed or not, and there is a serious argument that the compiler should police those, and there is a serious argument that the source of ambiguity is new from attribute((format_matches)), justifying the extra enforcement. For signedness, I don’t think that there are frequent security implications, and I think that the exact same ambiguity already exists with attribute((format)), and already have -Wformat-signedness as a solution. There’s also that non-decimal formats (octal, binary, hexadecimal) all accept unsigned integers only, so it would be illegal to print an %i with %x if we warned by default for it. I’m not sure that reading an integer out of a va_list with the wrong signedness is well-defined, but I don’t believe any compilers try to exploit it for optimizations and I don’t know any platform where it doesn’t work reliably.

Regarding ordering modifiers, we don’t have a “format strings don’t match” diagnostic. We only have “these two specifiers don’t match” with source locations for each. As I understand it, the change you are suggesting is effectively to add the “order modifiers are not currently supported” diagnostic, except worded to say “format strings don’t match” with no indication of what doesn’t match. Is that correct?

I guess I would assume these would be different flags. IMO, they should.

That is what I’m getting at, but I am feeling less strongly now. IMO, the signedness differences are a greater consternation for me at the moment (for the simple fact that it is part of the interface-guarantees that this attribute is adding), and I’m hopeful someone with better knowledge of the ordering can come along and assure me that this isn’t concerning.

I think I’m coming up with different answers because of “compatible”. With __attribute__((format)), compatible means “do the number and types of the given values match the ones expected by the format specifier string.” But with format_like, compatible is a bit more broad because there’s an extra layer of indirection (the format_like function calls a formatted function with a format specifier it doesn’t control).

As a contrived example, imagine an interface like:

__attribute__((format_like(printf, 2, "%x")))
void log_important_data(char *buffer, const char *format);

which writes the data to the given string buffer via vsprintf. Passing %#x may not be sufficiently compatible because the alternative printing form (#) may change the number of characters written out to the buffer and cause a buffer overrun.

That said, it seems like an incredibly fragile design to have a function which calls some formatted output with arguments, but that function doesn’t control the format specifier string. Does this situation come up often enough in practice to warrant supporting it?

I should point out that in the provided example, the unsafety isn’t related to the attribute: accepting that you need to do this, then the function is exactly as unsafe with or without format_like/format_matches. This brand of unsafety is caused by calling sprintf, which format_matches does not encourage over its sane alternatives like snprintf or asprintf.

I have helped teams enable -Wformat-nonliteral in over 800 projects since 2021, and in my experience, this is the last reason that some projects are still balking at the work and that isn’t a one-off. This mainly comes up in localization use cases, like this:

@implementation Book {
	NSString *title;
	unsigned pages;
	unsigned currentPage;
}

-(NSString *)progressDisplayStringWithFormat:(NSString *)intlFormat {
	// expects something like @"Page %u out of %u"
	return [NSString stringWithFormat:intlFormat, currentPage, pages];
}

@end

then users of the Book class use progressDisplayStringWithFormat: to create strings that are shown in UIs.

I would go as far as to say that it would be realistic to enable -Wformat-nonliteral by default when this lands (in the sense that although there would be lots of diagnostics, there would be realistic solutions for everyone), but I’m leaving this conversation for another day.

Okay, it sounds like there may be sufficient use cases to warrant this functionality. As with the -Wformat-nonliteral case, I’m quite surprised but happy enough to assume that’s accurate.

I think the RFC has consensus, but I’d like @erichkeane to make the final call in terms of whether he’d like to see changes to the design still or not.

1 Like

I’m uncomfortable with some of level of permissibility that this diagnostic is starting with, but frankly, that could very well just be my discomfort with format strings. I’d vastly prefer we be ‘as restrictive as possible’ at first, and visit loosening up restrictions on a 1x1 basis, but it seems that the submitter has uses for all the loosenings already, so I won’t stand in the way.

I think the RFC is acceptable as-it-sits now.

1 Like