Some DTrace probes for measuring per-file time.

thanks for the dtrace script. It's an important use case for our
upcoming bpf+tracing infrastructure in the linux kernel.
Looks like we're still missing vtimestamp equivalent (which should
be easy to add) and strings that are not well supported yet.
The rest of required features to make your script working is there.

linux also doesn't have usdt support yet, though few patches
were proposed. Looks like in this case if clang is compiled with
debug info, regular uprobes will work just fine.
I'm not worried about locking problem. I think uprobe
architecture doesn't have it, but will double check.
Thanks for all the pointers!

> Attached you'll fine a patch for clang which adds some USDT probes, along
> with a DTrace script for measuring total time spent in each header file
> using those probes (there are also "phony" files which track a handful of
> other things). This script measures time on CPU (DTrace's vtimestamp), so
> time blocked on IO (which is hidden by the parallelism of the build) does
> not come into play.

thanks for the dtrace script. It's an important use case for our
upcoming bpf+tracing infrastructure in the linux kernel.
Looks like we're still missing vtimestamp equivalent (which should
be easy to add) and strings that are not well supported yet.
The rest of required features to make your script working is there.

linux also doesn't have usdt support yet, though few patches
were proposed. Looks like in this case if clang is compiled with
debug info, regular uprobes will work just fine.
I'm not worried about locking problem. I think uprobe
architecture doesn't have it, but will double check.

Yeah, if you implement usdt on top of uprobes, then you should avoid the
issue altogether. One advantage of USDT is that the pages in memory are
only instrumented for the process itself (and only if it is actually being
traced). But that is a disadvantage in this case case when the goal is to
trace all instances of the same executable.

The locking issue is just an implementation detail, but it shows the added
complexity cost of the more general interface. In the script you will
notice that all the probes are `clangpp*:::....`. That `*` is actually a
globbing character. The actual probes are `clangpp12345:::.....` where
12345 is the pid being traced. There is special hack inside DTrace in the
kernel that recognizes this sort of globbing and ensures that whenever any
program that has `clangpp` probes starts up, it is intercepted and the
probes are enabled (the locking comes into play along this code path). So
as far as the core of DTrace is concerned, there are individual probes for
each process.

I think that the way this works is a product of history and oriented
towards tracing long-lived server processes (which makes sense on Solaris).
E.g. you say somethingd12345 to trace long lived process `somethingd` with
pid 12345. I believe that this was extended to support the globbing with
the intention of keeping an eye on "future server invocations", of which
there should be relatively few (hence the implementation that grabs a
global lock). This turns out to be a different use case than what is needed
for tracing compiler invocations, which are a large number of short-lived
processes and pid's aren't known ahead of time ever.

Uprobes (as far as I understand) takes a different approach where the
instrumentation happens by binary and is global to the system. This is a
really good design for tracing compiler invocations, but is suboptimal for
long-lived server processes. E.g. if you instrument malloc in libc.so, my
understanding is that with uprobes every process in the system is now
trapping when entering malloc (you can filter it out, but ideally you would
like to have *zero* overhead for processes not being traced).

The fundamental model of dtrace is that there is a list of probes `a:b:c:d`
in the kernel and that they can be enabled/disabled. Unfortunately this
conceptual model doesn't fit some real use cases very well. The first place
where issues crop up is the pid provider, e.g. `pid12435::malloc:enter`.
The way this is handled is that the userland dtrace(1M) tool actually will
grab pid 12345 using a debugger-like API and walk its symbols; when it
finds malloc, it will then do an ioctl telling the kernel to *create*
probes at the address of malloc in pid 12345. With the probe created,
dtrace(1M) then tells the kernel to enable that probe, just like it would a
regular probe. The second place where more issues crop up is in the
situation I described above, where you have a glob on pid numbers (under
the hood, USDT and the pid provider are implemented on top of "fasttrap";
which is the one and only "meta provider"
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/dtrace.h#2090
); in this case, the kernel itself (together with the dynamic linker)
conspire to create the probes as necessary for the current process.

Anyway... hopefully some of that will be helpful or provide some ideas.

-- Sean Silva