[libc++][RFC] Implementing Directory Entry Caching for Filesystem

Hi All,


I have a couple of questions and concerns regarding implementations of P0317R1 [1], which I would like help from the community answering. The paper changes directory_entry to cache a number of attributes, including file status, permissions, file size, number of hard links, last write time, and more. For reference, this is the interface of directory_entry before this paper [2], and this is the interface after [3].


The implementation has a lot of latitude over which attributes it caches, if any, and how it caches them. As I’ll explain below, each different approach can cause non-obvious behavior, possibly leading to TOCTOU bugs and security vulnerabilities!


My question for the community is this:


Given the considerations below, what should libc++ choose to do?


  1. Cache nothing?

  2. Cache as much as possible, when possible?

  3. Something in between?


I would like to land std::filesystem before the 7.0 branch, and doing so requires deciding on which implementation to provide now.


Note that this paper only considers POSIX based implementations of filesystem.

TOCTOU Violations

MSVC++’s caching is atomic as the underlying platform can be (in that we call only one API when refreshing). When enumerating a directory, FindFirstFileW/FindNextFileW return a WIN32_FIND_DATAW, which contain all of the data that directory_entry can cache. Note that the data returned is information about a reparse point (e.g. symlink or junction), never the reparse point target. If the results indicate that a reparse point is present, most data becomes uncached, though we can still answer some questions, like is_directory, without following a reparse point. We treat IO_REPARSE_TAG_SYMLINK as a symbolic link, and IO_REPARSE_TAG_MOUNT_POINT as an implementation-defined file_type::junction. All other reparse points are treated like ordinary directories, as that is the intended behavior of hierarchical storage management, clustered FS, and similar. refresh() is similarly atomic in that it calls only GetFileAttributesExW, which returns a WIN32_FILE_ATTRIBUTE_DATA, containing everything WIN32_FIND_DATAW does except the reparse point type tag, so if we see a reparse point at all we don’t/can’t cache whether it is a symlink or not if refresh() has been called.

I think the best solution is to revert these changes to the standard

I think we have to be strongly against this, as we have already shipped this in production, and lack of caching prevents us from exposing data returned directly by our platform directory enumeration API (this is why we argued so strongly to put this in in the first place).

I think all stat()-like APIs are intrinsically vulnerable to the TOCTOU problem you describe. For example, even with “atomic attribute access”, your example “is_empty_regular_file” is broken before it even returns. If you built the same function out of a stat call it would be just as broken. I don’t think that is a problem std::filesystem can solve.

Billy3

MSVC++’s caching is atomic as the underlying platform can be (in that we call only one API when refreshing). When enumerating a directory, FindFirstFileW/FindNextFileW return a WIN32_FIND_DATAW, which contain all of the data that directory_entry can cache. Note that the data returned is information about a reparse point (e.g. symlink or junction), never the reparse point target. If the results indicate that a reparse point is present, most data becomes uncached, though we can still answer some questions, like is_directory, without following a reparse point. We treat IO_REPARSE_TAG_SYMLINK as a symbolic link, and IO_REPARSE_TAG_MOUNT_POINT as an implementation-defined file_type::junction. All other reparse points are treated like ordinary directories, as that is the intended behavior of hierarchical storage management, clustered FS, and similar. refresh() is similarly atomic in that it calls only GetFileAttributesExW, which returns a WIN32_FILE_ATTRIBUTE_DATA, containing everything WIN32_FIND_DATAW does except the reparse point type tag, so if we see a reparse point at all we don’t/can’t cache whether it is a symlink or not if refresh() has been called.

I think the best solution is to revert these changes to the standard

I think we have to be strongly against this, as we have already shipped this in production, and lack of caching prevents us from exposing data returned directly by our platform directory enumeration API (this is why we argued so strongly to put this in in the first place).

I suspected and stated that it is probably too late to do this. None the less, I wanted to state my opinion. I really don’t like any of the options I have to implement this.

Additionally, I still think there is a need for a type which “atomically” caches all of the attributes available via stat or lstat, so users can access a handful of them without multiple calls to the underlying API.
Though that can still be done orthogonally to this paper.

I think all stat()-like APIs are intrinsically vulnerable to the TOCTOU problem you describe. For example, even with “atomic attribute access”, your example “is_empty_regular_file” is broken before it even returns. If you built the same function out of a stat call it would be just as broken. I don’t think that is a problem std::filesystem can solve.

Yes, my is_empty_regular_file is still broken before it even returns. The point I was trying to make is that it should be obvious to the user how it’s broken and why.
When a user calls is_regular_file(ent) && file_size(ent) == 0 then at least the existence of the bug is a lot more obvious upon reading, instead of being hidden behind a buggy interface.

And yes, POSIX is vulnerable to this everywhere. And although I can’t fix that, I don’t want to make libc++ more prone to TOCTOU than it needs to be.

Thank you for putting this together, I greatly appreciate it!

Out of curiosity, is the caching behavior something you expect you'll
make configurable for users of libc++ (for instance, a macro to enable
or disable caching), or are there issues preventing such an
implementation? My feeling is that we're going to have the usual push
and pull between security and performance where different users will
have different (and equally valid) use cases in mind justifying the
need for the performance provided by caching or the security red flags
enhanced by not caching. We should obviously pick a good default
behavior (and can argue over the definition of "good" in that
context), but I think no matter what option we pick, users will want
the other option to be available.

~Aaron

Does anyone have an example of something that is secure without caching that is insecure with caching? All of the examples posted as problematic are still insecure.

Billy3

Thank you for putting this together, I greatly appreciate it!

Out of curiosity, is the caching behavior something you expect you’ll
make configurable for users of libc++ (for instance, a macro to enable
or disable caching), or are there issues preventing such an
implementation? My feeling is that we’re going to have the usual push
and pull between security and performance where different users will
have different (and equally valid) use cases in mind justifying the
need for the performance provided by caching or the security red flags
enhanced by not caching. We should obviously pick a good default
behavior (and can argue over the definition of “good” in that
context), but I think no matter what option we pick, users will want
the other option to be available.

I hadn’t thought of making it configurable, but I’m happy to do so.
It’ll have to maintain the same ABI, so the cache will simply be unused,
but providing a configuration option seems feasible.

Adding CC’s for some other potentially interested parties.

Does anyone have an example of something that is secure without caching that is insecure with caching? All of the examples posted as problematic are still insecure.

+1. By my reading, different caching decisions can only make TOCTOU problems more or less acute; caching can’t introduce TOCTOU problems that don’t already exist. From that point of view, caching could actually be helpful, by making the TOCTOU problems more obvious. The only change we could meaningfully make here is to give users some conditions under which non-modifying operations on the same file are guaranteed to reflect a single self-consistent state of the file (which may or may not be the present state), but the benefit of that seems minimal.

I think this ship sailed when we decided to standardize a TOCTOU-prone API; all we can do now is try to educate users about what not to use std::filesystem for (e.g. anything remotely related to security).

Thank you for putting this together, I greatly appreciate it!

Out of curiosity, is the caching behavior something you expect you’ll
make configurable for users of libc++ (for instance, a macro to enable
or disable caching), or are there issues preventing such an
implementation? My feeling is that we’re going to have the usual push
and pull between security and performance where different users will
have different (and equally valid) use cases in mind justifying the
need for the performance provided by caching or the security red flags
enhanced by not caching. We should obviously pick a good default
behavior (and can argue over the definition of “good” in that
context), but I think no matter what option we pick, users will want
the other option to be available.

I hadn’t thought of making it configurable, but I’m happy to do so.
It’ll have to maintain the same ABI, so the cache will simply be unused,
but providing a configuration option seems feasible.

After attempting to implement a configuration option to disable caching, I don’t think it’s possible.
Because so much of lives across a library boundary, it’s tricky to make that library
code sensitive to a macro defined later by the user.

That being said, perhaps we could file an issue to add directory_entry::clear_cache() or similar,
so that users can explicitly make their directory_entry cacheless?

/Eric

Does anyone have an example of something that is secure without caching that is insecure with caching? All of the examples posted as problematic are still insecure.

+1. By my reading, different caching decisions can only make TOCTOU problems more or less acute; caching can’t introduce TOCTOU problems that don’t already exist. From that point of view, caching could actually be helpful, by making the TOCTOU problems more obvious. The only change we could meaningfully make here is to give users some conditions under which non-modifying operations on the same file are guaranteed to reflect a single self-consistent state of the file (which may or may not be the present state), but the benefit of that seems minimal.

I think this ship sailed when we decided to standardize a TOCTOU-prone API; all we can do now is try to educate users about what not to use std::filesystem for (e.g. anything remotely related to security).

Thanks for the excellent input. I’ll go ahead and implement it with caching.

/Eric

Does anyone have an example of something that is secure without caching
that is insecure with caching? All of the examples posted as problematic are
still insecure.

+1. By my reading, different caching decisions can only make TOCTOU problems
more or less acute; caching can't introduce TOCTOU problems that don't
already exist.

That is my belief as well. However, I don't find the argument to be
persuasive, so it's a -1 from me.

From that point of view, caching could actually be helpful,
by making the TOCTOU problems more obvious.

Caching is hidden from the user's view, and I fail to see how
something hidden from the user's view will will make anything more
obvious. What caching does do is extend the length of time over which
TOCTOU bugs can occur, which is worse behavior from a security
perspective. Users will definitely introduce exciting TOCTOU bugs into
their code with or without the caching mechanism, but I'm worried
about worsening the effects in practice.

The only change we could
meaningfully make here is to give users some conditions under which
non-modifying operations on the same file are guaranteed to reflect a single
self-consistent state of the file (which may or may not be the present
state), but the benefit of that seems minimal.

I think this ship sailed when we decided to standardize a TOCTOU-prone API;
all we can do now is try to educate users about what not to use
std::filesystem for (e.g. anything remotely related to security).

"This doesn't introduce new security issues, it just makes existing
ones worse" is not an argument that leaves me with warm, fuzzy
feelings. The goal is obviously to prevent the TOCTOU bugs in the
first place, but if we cannot achieve that, we shouldn't exacerbate
the TOCTOU problems only because we can't achieve the ideal.

~Aaron

Thank you for putting this together, I greatly appreciate it!

Out of curiosity, is the caching behavior something you expect you'll
make configurable for users of libc++ (for instance, a macro to enable
or disable caching), or are there issues preventing such an
implementation? My feeling is that we're going to have the usual push
and pull between security and performance where different users will
have different (and equally valid) use cases in mind justifying the
need for the performance provided by caching or the security red flags
enhanced by not caching. We should obviously pick a good default
behavior (and can argue over the definition of "good" in that
context), but I think no matter what option we pick, users will want
the other option to be available.

I hadn't thought of making it configurable, but I'm happy to do so.
It'll have to maintain the same ABI, so the cache will simply be unused,
but providing a configuration option seems feasible.

After attempting to implement a configuration option to disable caching, I
don't think it's possible.
Because so much of <filesystem> lives across a library boundary, it's tricky
to make that library
code sensitive to a macro defined later by the user.

Crud, I was worried that might be the case. :frowning:

That being said, perhaps we could file an issue to add
`directory_entry::clear_cache()` or similar,
so that users can explicitly make their `directory_entry` cacheless?

I'd have to think about this further, but my concern is primarily with
users who already have TOCTOU bugs that will have their attack vectors
expanded by caching. If they knew their code had a TOCTOU bug in it
and cared about that, they wouldn't need this function at all because
they'd fix their code. Based on that, I'm not certain it'd be much of
a win and could potentially lead to even worse code, like thinking
clear_cache() fixes the TOCTOU issues rather than fixing the real
issue.

~Aaron

Does anyone have an example of something that is secure without caching
that is insecure with caching? All of the examples posted as problematic are
still insecure.

+1. By my reading, different caching decisions can only make TOCTOU problems
more or less acute; caching can’t introduce TOCTOU problems that don’t
already exist.

That is my belief as well. However, I don’t find the argument to be
persuasive, so it’s a -1 from me.

From that point of view, caching could actually be helpful,
by making the TOCTOU problems more obvious.

Caching is hidden from the user’s view, and I fail to see how
something hidden from the user’s view will will make anything more
obvious. What caching does do is extend the length of time over which
TOCTOU bugs can occur, which is worse behavior from a security
perspective. Users will definitely introduce exciting TOCTOU bugs into
their code with or without the caching mechanism, but I’m worried
about worsening the effects in practice.

Generally speaking, I would expect “worsening the effects” of the bugs to make those bugs more obvious.

The only change we could
meaningfully make here is to give users some conditions under which
non-modifying operations on the same file are guaranteed to reflect a single
self-consistent state of the file (which may or may not be the present
state), but the benefit of that seems minimal.

I think this ship sailed when we decided to standardize a TOCTOU-prone API;
all we can do now is try to educate users about what not to use
std::filesystem for (e.g. anything remotely related to security).

“This doesn’t introduce new security issues, it just makes existing
ones worse” is not an argument that leaves me with warm, fuzzy
feelings. The goal is obviously to prevent the TOCTOU bugs in the
first place, but if we cannot achieve that, we shouldn’t exacerbate
the TOCTOU problems only because we can’t achieve the ideal.

That’s not so obvious to me. If exacerbating the TOCTOU problems makes people more aware of those problems, that could well be a net win. Conversely, to the extent that the caching design makes it easier to use std::filesystem in security-sensitive contexts, that is likely to be a net loss, because std::filesystem simply should not be used in security-sensitive contexts.

Keep in mind, security exploits are generally not statistical in nature; halving the time window for a TOCTOU attack does not halve your vulnerability, because the components of the attack are not occurring at random; they’re being induced by an attacker. Shortening the time window may have some security benefit on the margins in terms of increasing the effort required for a successful exploit, but that’s highly contingent and uncertain. I expect the security benefits of eliminating a TOCTOU vulnerability (e.g. by using a safer filesystem API) to dwarf the benefits of almost any reduction in the vulnerability time window.

As I'll explain below, each different approach can cause non-obvious
behavior, possibly leading to TOCTOU bugs and security
vulnerabilities!

Anyone who has real TOCTOU concerns (like operating in directories controlled by other users) has to do special things anyway: in C++17, they can current_path("foo/bar") to avoid re-resolving foo and bar on future lookups, while in POSIX the better answer is to use the *at functions to avoid interference between threads and libraries via the process-wide shared CWD state.

Above, a TOCTOU violation may occur if another process acts on the
file referenced by `ent`, changing the attributes between the point
the directory_entry was created and the point at which the
`is_symlink` condition is checked.

But another process can already act between the is_symlink and the call to remove. Absent a userspace dentry-level interface, it's not clear what the additional concern is from the is_symlink value being cached.

However, at least the above example makes it clear *when* the check
is occuring.

Why does it matter that the location of the check is obvious so long as (it's at least as obvious that) there's a race window after it?

Using the `refresh` function, either called directly by the user, or
implicitly by the constructors or modifying methods (Ex. `assign`).
`refresh` has the ability to fully populate the cache

It would certainly be nice to cache all the results from [l]stat(2) (C++17 Late NB comment 20). However, I'm not sure that you can cache anything more in refresh() than in directory_iterator::operator++. The normative encouragement in [fs.class.directory_entry]/2 specifically says "during directory iteration" without suggesting that other attributes could be cached at other times.

Of course, then I'm not sure that even d_type can be cached: [fs.dir.entry.obs]/27 says that a file_status is either cached or not, and it also contains the permissions (from stat(2)).

  if (!ent.is_symlink())
   return false;

// We should have the file_status for the symlink cached, right?
file_status st = ent.symlink_status();

// Nope! Only when `refresh` has been called are the perms cached.
assert(is_symlink(st)); // May fail!

The assertion failure is surprising only if the author assumed that the implementation was caching anything. Caching of any particular attribute can only increase consistency with regard to it, in that there are fewer system calls that might observe updates. So code that is correct without caching will not become more racy when it is introduced.

bool is_empty_regular_file(path p) {
directory_entry ent(p); // Calls refresh()
return ent.is_regular_file() && ent.file_size() == 0;
}

It's fine to write code this way in an attempt to capitalize on caching as an optimization, but any coherence it adds is non-portable. There is simply no way in C++17 to make this function reliable (thus the NB comment). Of course, there's only so much a race-free implementation could get you, given the race window before any subsequent operation.

Worse yet, the latter case, which fully populates the cache
non-atomically, potentially commits a TOCTOU violation itself, and
this violation is hidden from the user, preventing them from knowing
it exists or being able to do anything about it.

First, note that both this caching and is_empty_regular_file (without any caching) can produce "impossible" results:

In a directory that never contains any empty regular files, is_empty_regular_file can return true because a regular file is replaced with an empty non-regular file (e.g., a device file).

In a directory that never contains a symlink to a directory, a non-atomic caching directory_entry can have is_symlink and is_directory true simultaneously because a symlink was replaced with a directory (or vice versa).

The latter case concerns me a bit more: because the order of the system calls might not match the order of the queries, they can return results inconsistent with what is known to be a single possibly-concurrent change to a file. The motivated user can use refresh to recover from such an impossibility, while in the case of an unbounded number of concurrent changes nothing can be guaranteed anyway.

Afterwards, a better and safer proposal can be put forward which to provides efficient and atomic access to multiple attributes for the
same entity can be proposed.

It's hard to standardize atomic access to properties that are not atomically accessible everywhere. But one could more easily entertain separate accessors for cached values (which return "unknown" as appropriate), so that the decision to use them (where available) falls to the user.

Davis

There is a long history of intentionally making security bugs ‘worse' to improve security. Many memory safety techniques, for example, turn a possibly-benign buffer overflow into a segmentation violation. Now, rather than silent memory corruption that might not be a problem in the absence of a malicious adversary, you get an obvious bug that you have to fix.

The filesystem API seems poorly designed on many levels, but turning a security vulnerability that is only triggered in the presence of an attacker into a bug that is likely to affect normal users seems like a good model in general. The move likely a bug is to occur, the more likely it is to be fixed.

David

>
>
>>
>> Does anyone have an example of something that is secure without caching
>> that is insecure with caching? All of the examples posted as
>> problematic are
>> still insecure.
>
>
> +1. By my reading, different caching decisions can only make TOCTOU
> problems
> more or less acute; caching can't introduce TOCTOU problems that don't
> already exist.

That is my belief as well. However, I don't find the argument to be
persuasive, so it's a -1 from me.

> From that point of view, caching could actually be helpful,
> by making the TOCTOU problems more obvious.

Caching is hidden from the user's view, and I fail to see how
something hidden from the user's view will will make anything more
obvious. What caching does do is extend the length of time over which
TOCTOU bugs can occur, which is worse behavior from a security
perspective. Users will definitely introduce exciting TOCTOU bugs into
their code with or without the caching mechanism, but I'm worried
about worsening the effects in practice.

Generally speaking, I would expect "worsening the effects" of the bugs to
make those bugs more obvious.

I don't think that applies to cases that are timing based. You have to
know to be actively testing that scenario to notice the worsening
effects, at which point you'd likely have already fixed the TOCTOU.

> The only change we could
> meaningfully make here is to give users some conditions under which
> non-modifying operations on the same file are guaranteed to reflect a
> single
> self-consistent state of the file (which may or may not be the present
> state), but the benefit of that seems minimal.
>
> I think this ship sailed when we decided to standardize a TOCTOU-prone
> API;
> all we can do now is try to educate users about what not to use
> std::filesystem for (e.g. anything remotely related to security).

"This doesn't introduce new security issues, it just makes existing
ones worse" is not an argument that leaves me with warm, fuzzy
feelings. The goal is obviously to prevent the TOCTOU bugs in the
first place, but if we cannot achieve that, we shouldn't exacerbate
the TOCTOU problems only because we can't achieve the ideal.

That's not so obvious to me. If exacerbating the TOCTOU problems makes
people more aware of those problems, that could well be a net win.

How is it making more people aware of the problem, though? Silently
exacerbating the issue doesn't mean people are going to become aware
of it when they previously were not, unless you consider "there's a
new CVE out today" being the kind of education that raises awareness,
but I'd hardly call that a net win.

Conversely, to the extent that the caching design makes it easier to use
std::filesystem in security-sensitive contexts, that is likely to be a net
loss, because std::filesystem simply should not be used in
security-sensitive contexts.

Last time I looked into it, Clang and LLVM had quite a few TOCTOU
bugs. No one was compelled to fix them because hey, we're just a
compiler, so who cares -- it's not a security-sensitive context. Well,
until people started running it on web servers... (It's been a number
of years since I looked, and we may have improved this since clangd
came along, but the example still holds.)

The trouble is that "security-sensitive context" is a moving target
for a whole lot of code.

Keep in mind, security exploits are generally not statistical in nature;
halving the time window for a TOCTOU attack does not halve your
vulnerability, because the components of the attack are not occurring at
random; they're being induced by an attacker. Shortening the time window may
have some security benefit on the margins in terms of increasing the effort
required for a successful exploit, but that's highly contingent and
uncertain. I expect the security benefits of eliminating a TOCTOU
vulnerability (e.g. by using a safer filesystem API) to dwarf the benefits
of almost any reduction in the vulnerability time window.

We're in agreement here that eliminating the TOCTOU is the only way to
eliminate the vulnerability. However, much of security is about taking
many small steps to mitigate security issues and reduce attack vectors
as part of defense-in-depth.

I guess my point is: let's not be quick to say "well, we can't solve
this to the ideal, so we can do whatever we want in the name of
performance." As history has unkindly demonstrated, that way lies many
years of CVEs. I think a lit review of TOCTOU papers might be
interesting to see if anyone has explored the ramifications of a
widened time window. I'll try to do that when I have a spare moment,
but if anyone gets to it before me, I'd love to hear the results.

~Aaron

Does anyone have an example of something that is secure without caching
that is insecure with caching? All of the examples posted as
problematic are
still insecure.

+1. By my reading, different caching decisions can only make TOCTOU
problems
more or less acute; caching can’t introduce TOCTOU problems that don’t
already exist.

That is my belief as well. However, I don’t find the argument to be
persuasive, so it’s a -1 from me.

From that point of view, caching could actually be helpful,
by making the TOCTOU problems more obvious.

Caching is hidden from the user’s view, and I fail to see how
something hidden from the user’s view will will make anything more
obvious. What caching does do is extend the length of time over which
TOCTOU bugs can occur, which is worse behavior from a security
perspective. Users will definitely introduce exciting TOCTOU bugs into
their code with or without the caching mechanism, but I’m worried
about worsening the effects in practice.

Generally speaking, I would expect “worsening the effects” of the bugs to
make those bugs more obvious.

I don’t think that applies to cases that are timing based. You have to
know to be actively testing that scenario to notice the worsening
effects, at which point you’d likely have already fixed the TOCTOU.

I agree that “more obvious” may still not be obvious enough.

The only change we could
meaningfully make here is to give users some conditions under which
non-modifying operations on the same file are guaranteed to reflect a
single
self-consistent state of the file (which may or may not be the present
state), but the benefit of that seems minimal.

I think this ship sailed when we decided to standardize a TOCTOU-prone
API;
all we can do now is try to educate users about what not to use
std::filesystem for (e.g. anything remotely related to security).

“This doesn’t introduce new security issues, it just makes existing
ones worse” is not an argument that leaves me with warm, fuzzy
feelings. The goal is obviously to prevent the TOCTOU bugs in the
first place, but if we cannot achieve that, we shouldn’t exacerbate
the TOCTOU problems only because we can’t achieve the ideal.

That’s not so obvious to me. If exacerbating the TOCTOU problems makes
people more aware of those problems, that could well be a net win.

How is it making more people aware of the problem, though? Silently
exacerbating the issue doesn’t mean people are going to become aware
of it when they previously were not, unless you consider “there’s a
new CVE out today” being the kind of education that raises awareness,
but I’d hardly call that a net win.

If the problems are exacerbated to the point that they’re visible even in non-attack scenarios (which Eric seemed to be concerned about), that could make people more aware of the problem.

Conversely, to the extent that the caching design makes it easier to use
std::filesystem in security-sensitive contexts, that is likely to be a net
loss, because std::filesystem simply should not be used in
security-sensitive contexts.

Last time I looked into it, Clang and LLVM had quite a few TOCTOU
bugs. No one was compelled to fix them because hey, we’re just a
compiler, so who cares – it’s not a security-sensitive context. Well,
until people started running it on web servers… (It’s been a number
of years since I looked, and we may have improved this since clangd
came along, but the example still holds.)

The trouble is that “security-sensitive context” is a moving target
for a whole lot of code.

Personally, I agree, but WG21’s decision to standardize std::filesystem reflects a judgement that programmers can be expected to distinguish contexts where TOCTOU is an issue from contexts where it is not.

Keep in mind, security exploits are generally not statistical in nature;
halving the time window for a TOCTOU attack does not halve your
vulnerability, because the components of the attack are not occurring at
random; they’re being induced by an attacker. Shortening the time window may
have some security benefit on the margins in terms of increasing the effort
required for a successful exploit, but that’s highly contingent and
uncertain. I expect the security benefits of eliminating a TOCTOU
vulnerability (e.g. by using a safer filesystem API) to dwarf the benefits
of almost any reduction in the vulnerability time window.

We’re in agreement here that eliminating the TOCTOU is the only way to
eliminate the vulnerability. However, much of security is about taking
many small steps to mitigate security issues and reduce attack vectors
as part of defense-in-depth.

I guess my point is: let’s not be quick to say “well, we can’t solve
this to the ideal, so we can do whatever we want in the name of
performance.”

I’m not saying we can’t “solve this to the ideal”, I’m saying we can’t “solve” it at all; at best we can apply mitigations that may reduce the vulnerability by some degree that’s unknown, but probably small.

As history has unkindly demonstrated, that way lies many
years of CVEs.

What history do you have in mind? My sense is that “let’s patch up this fundamentally vulnerable API to try to make it a little harder to attack” has also been a path to many years of CVEs.

I think a lit review of TOCTOU papers might be
interesting to see if anyone has explored the ramifications of a
widened time window. I’ll try to do that when I have a spare moment,
but if anyone gets to it before me, I’d love to hear the results.

That makes sense, although I’d be mildly surprised if there was enough literature to settle the question. Ideally we’d consult with some security experts about whether this is worth addressing.

We’re in agreement here that eliminating the TOCTOU is the only way to

eliminate the vulnerability. However, much of security is about taking

many small steps to mitigate security issues and reduce attack vectors

as part of defense-in-depth.

Defense in depth is about taking a system that you think is secure, and building additional barriers such that compromise of one part of that system does not compromise a machine. Defense in depth is not a thing when the original thing is already full of holes. There is no defense in depth thing you can do to an API like exists to avoid this problem.

Billy3