FW: RFC: YAML as an intermediate format for clang::tooling::Replacement data on disk

Adding Manuel.

+djasper

Hi all,

This discussion began on cfe-commits as the result of a commit (Tareq's poor header replacement patch that keeps getting reverted due to Windows build-bot issues). The start of the thread is here if you want background info: http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/084881.html.

The proposal: The C++11 Migrator has a need to write Replacement data: offset, length, and replacement text, to disk. The replacement data describes changes made to a header while transforming one or more TU's. All the replacement data would be gathered up after an entire code-base is transformed by a separate tool and merged together to produce actual changes to headers. So the point is to serialize Replacement data as a form of inter-process communication using the file system as the communication link. Real inter-process communication is a possibility but not portable.

I have to wonder whether it's not easier to just ensure that headers are only transformed once.

I understand there's the issue of deciding what compiler flags to use when processing a header. My thoughts on that:
* For some projects, there aren't any per-file compiler flags, so it would be sufficient to just pass a general set of flags to the tool on the command line (e.g., with made up parameter syntax, something like 'cpp11-migrate *.h —compile-flags="-I../include –DFOO"'…)
* For other projects, a simple heuristic of matching "foo.{cpp,cc,cxx,…}" to "foo.{h,hh,hpp,…}" might be enough (lots of details to sort out here, like how to specify the directory structure, but hopefully you get the idea)
* For more complicated cases, one could add (whether manually or using some tool) entries to the compilation database for header files

With that in mind, why not treat header files like source files and process them separately?

If the issue is parallel compilation, deferring the replacements makes perfect sense as a way to resolve any read-write conflicts (transforming one header while it's being parsed as part of another TU). However, if you ensure that a header isn't touched by multiple transformations, and generally ensure that transformations don't clobber each other by design, there's no need to merge anything.

Personally I would even accept a slightly more limited set of transformations in exchange for never having to worry about merging.

Stefanus

From: Vane, Edwin
Sent: Wednesday, July 31, 2013 11:40 AM
To: Clang Dev List (cfe-dev@cs.uiuc.edu<mailto:cfe-dev@cs.uiuc.edu>)
Subject: RFC: YAML as an intermediate format for
clang::tooling::Replacement data on disk

Hi all,

This discussion began on cfe-commits as the result of a commit (Tareq's
poor header replacement patch that keeps getting reverted due to Windows
build-bot issues). The start of the thread is here if you want background
info:
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/084881.html
.

The proposal: The C++11 Migrator has a need to write Replacement data:
offset, length, and replacement text, to disk. The replacement data
describes changes made to a header while transforming one or more TU's. All
the replacement data would be gathered up after an entire code-base is
transformed by a separate tool and merged together to produce actual
changes to headers. So the point is to serialize Replacement data as a form
of inter-process communication using the file system as the communication
link. Real inter-process communication is a possibility but not portable.

I have to wonder whether it's not easier to just ensure that headers are
only transformed once.

I understand there's the issue of deciding what compiler flags to use when
processing a header. My thoughts on that:
* For some projects, there aren't any per-file compiler flags, so it
would be sufficient to just pass a general set of flags to the tool on the
command line (e.g., with made up parameter syntax, something like
'cpp11-migrate *.h —compile-flags="-I../include –DFOO"'…)
* For other projects, a simple heuristic of matching "foo.{cpp,cc,cxx,…}"
to "foo.{h,hh,hpp,…}" might be enough (lots of details to sort out here,
like how to specify the directory structure, but hopefully you get the idea)
* For more complicated cases, one could add (whether manually or using
some tool) entries to the compilation database for header files

With that in mind, why not treat header files like source files and
process them separately?

How do you propose to treat template instantiations?

For example:
a.h:
template <typename T> class A { void x(T t) { t.y(); }}

x.cc:
A<C> a; a.x();

Imagine we want to change C::y -> C::z. Now depending on which types A is
instantiated with, it might be totally safe to refactor t.y() in A or not.
So there needs to be a postprocessing step that figures that out anyway.

If the issue is parallel compilation, deferring the replacements makes
perfect sense as a way to resolve any read-write conflicts (transforming
one header while it's being parsed as part of another TU). However, if you
ensure that a header isn't touched by multiple transformations, and
generally ensure that transformations don't clobber each other by design,
there's no need to merge anything.

Personally I would even accept a slightly more limited set of
transformations in exchange for never having to worry about merging.

"Merging" is usually merely deduplication, which is not hard. I have the
feeling that you think there's lots of complexity where it isn't. I'd
definitely say that the heuristics you propose in order to be able to
process headers on their own are much higher than the issue of
deduplicating edits.

Cheers,
/Manuel

> From: Vane, Edwin
> Sent: Wednesday, July 31, 2013 11:40 AM
>
> To: Clang Dev List (cfe-dev@cs.uiuc.edu<mailto:cfe-dev@cs.uiuc.edu>)
> Subject: RFC: YAML as an intermediate format for
clang::tooling::Replacement data on disk
>
> > Hi all,
> >
> > This discussion began on cfe-commits as the result of a commit
(Tareq's poor header replacement patch that keeps getting reverted due to
Windows build-bot issues). The start of the thread is here if you want
background info:
> >
> >
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/084881
.html
<http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/08488
1.html>.
> >
> > The proposal: The C++11 Migrator has a need to write Replacement
data: offset, length, and replacement text, to disk. The replacement data
describes changes made to a header while transforming one or more TU's.
All the replacement data would be gathered up
> > after an entire code-base is transformed by a separate tool and
merged together to produce actual changes to headers. So the point is to
serialize Replacement data as a form of inter-process communication using
the file system as the communication link. Real
> > inter-process communication is a possibility but not portable.

> I have to wonder whether it's not easier to just ensure that headers
are only transformed once.
>
> I understand there's the issue of deciding what compiler flags to use
when processing a header. My thoughts on that:
> * For some projects, there aren't any per-file compiler flags, so it
would be sufficient to just pass a general set of flags to the tool on
the command line (e.g., with made up parameter syntax, something like
'cpp11-migrate *.h ‹compile-flags="-I../include
> ­DFOO"'Š)
> * For other projects, a simple heuristic of matching
"foo.{cpp,cc,cxx,Š}" to "foo.{h,hh,hpp,Š}" might be enough (lots of
details to sort out here, like how to specify the directory structure,
but hopefully you get the idea)
> * For more complicated cases, one could add (whether manually or using
some tool) entries to the compilation database for header files
>
> With that in mind, why not treat header files like source files and
process them separately?

How do you propose to treat template instantiations?

For example:
a.h:
template <typename T> class A { void x(T t) { t.y(); }}

x.cc:
A<C> a; a.x();

Imagine we want to change C::y -> C::z. Now depending on which types A
is instantiated with, it might be totally safe to refactor t.y() in A or
not. So there needs to be a postprocessing step that figures that out
anyway.

Given a class like:

template<typename T>
class MyVector {
  MyVector() : m_begin(0), m_end(0) {}
  MyVector(std::size_t size) : m_begin(new T[size]), m_end(m_begin + size)
{}

  // ...

private:
  T* m_begin;
  T* m_end;
};

I would love for cpp11-migrate to be able to turn that into something like:

template<typename T>
class MyVector {
  MyVector() = default;
  MyVector(std::size_t size) : m_begin(new T[size]), m_end(m_begin + size)
{}

  // ...

private:
  T* m_begin = nullptr;
  T* m_end = nullptr;
};

And I contend that it doesn't need to know what MyVector could be
instantiated with in order to do that.

Now, I totally understand that I may be asking for something very
difficult to do in Clang today. In which case, I'll accept whatever
limitations there are. But I don't think that assuming we'll see the full
set of instantiations is a great solution either.

> If the issue is parallel compilation, deferring the replacements makes
perfect sense as a way to resolve any read-write conflicts (transforming
one header while it's being parsed as part of another TU). However, if
you ensure that a header isn't touched by
> multiple transformations, and generally ensure that transformations
don't clobber each other by design, there's no need to merge anything.
>
> Personally I would even accept a slightly more limited set of
transformations in exchange for never having to worry about merging.

"Merging" is usually merely deduplication, which is not hard. I have the
feeling that you think there's lots of complexity where it isn't. I'd
definitely say that the heuristics you propose in order to be able to
process headers on their own are much higher
than the issue of deduplicating edits.

If it's just deduplication, it's no big deal - totally agreed.

I may have been thrown off by this: https://github.com/revane/migmerge_git

That tool seems to (aim to) handle a lot more than deduplication.

More importantly: I certainly hope it's not going to be necessary to
download a separate "merge tool" in order to use cpp11-migrate.

To me, cpp11-migrate is the kind of tool I'm just not going to use if it's
not dead simple to use. As a user, I don't really care how it works
internally, and I don't want to care :). I _am_ willing to tell it (a
reasonable amount of) things about my code base that it can't easily infer
itself.

Stefanus

From: Manuel Klimek <klimek@google.com>:
>
>
> > From: Vane, Edwin
> > Sent: Wednesday, July 31, 2013 11:40 AM
> >
> > To: Clang Dev List (cfe-dev@cs.uiuc.edu<mailto:cfe-dev@cs.uiuc.edu>)
> > Subject: RFC: YAML as an intermediate format for
>clang::tooling::Replacement data on disk
> >
> > > Hi all,
> > >
> > > This discussion began on cfe-commits as the result of a commit
>(Tareq's poor header replacement patch that keeps getting reverted due to
>Windows build-bot issues). The start of the thread is here if you want
>background info:
> > >
> > >
>
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/084881
>.html
><
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20130729/08488
>1.html>.
> > >
> > > The proposal: The C++11 Migrator has a need to write Replacement
>data: offset, length, and replacement text, to disk. The replacement data
>describes changes made to a header while transforming one or more TU's.
>All the replacement data would be gathered up
> > > after an entire code-base is transformed by a separate tool and
>merged together to produce actual changes to headers. So the point is to
>serialize Replacement data as a form of inter-process communication using
>the file system as the communication link. Real
> > > inter-process communication is a possibility but not portable.
>
> > I have to wonder whether it's not easier to just ensure that headers
>are only transformed once.
> >
> > I understand there's the issue of deciding what compiler flags to use
>when processing a header. My thoughts on that:
> > * For some projects, there aren't any per-file compiler flags, so it
>would be sufficient to just pass a general set of flags to the tool on
>the command line (e.g., with made up parameter syntax, something like
>'cpp11-migrate *.h ‹compile-flags="-I../include
> > ­DFOO"'Š)
> > * For other projects, a simple heuristic of matching
>"foo.{cpp,cc,cxx,Š}" to "foo.{h,hh,hpp,Š}" might be enough (lots of
>details to sort out here, like how to specify the directory structure,
>but hopefully you get the idea)
> > * For more complicated cases, one could add (whether manually or using
>some tool) entries to the compilation database for header files
> >
> > With that in mind, why not treat header files like source files and
>process them separately?
>
> How do you propose to treat template instantiations?
>
> For example:
> a.h:
> template <typename T> class A { void x(T t) { t.y(); }}
>
> x.cc:
> A<C> a; a.x();
>
> Imagine we want to change C::y -> C::z. Now depending on which types A
>is instantiated with, it might be totally safe to refactor t.y() in A or
>not. So there needs to be a postprocessing step that figures that out
>anyway.

Given a class like:

template<typename T>
class MyVector {
  MyVector() : m_begin(0), m_end(0) {}
  MyVector(std::size_t size) : m_begin(new T[size]), m_end(m_begin + size)
{}

  // ...

private:
  T* m_begin;
  T* m_end;
};

I would love for cpp11-migrate to be able to turn that into something like:

template<typename T>
class MyVector {
  MyVector() = default;
  MyVector(std::size_t size) : m_begin(new T[size]), m_end(m_begin + size)
{}

  // ...

private:
  T* m_begin = nullptr;
  T* m_end = nullptr;
};

And I contend that it doesn't need to know what MyVector could be
instantiated with in order to do that.

I'm not sure what an example of a change that doesn't need to know the
instantiations has to do with it. I fully agree that those exist :slight_smile: I'm
merely proposing that the others exist, too.

Now, I totally understand that I may be asking for something very
difficult to do in Clang today. In which case, I'll accept whatever
limitations there are. But I don't think that assuming we'll see the full
set of instantiations is a great solution either.

I think it'll be a necessary solution for many refactoring tools.

> > If the issue is parallel compilation, deferring the replacements makes
>perfect sense as a way to resolve any read-write conflicts (transforming
>one header while it's being parsed as part of another TU). However, if
>you ensure that a header isn't touched by
> > multiple transformations, and generally ensure that transformations
>don't clobber each other by design, there's no need to merge anything.
> >
> > Personally I would even accept a slightly more limited set of
>transformations in exchange for never having to worry about merging.
>
>
> "Merging" is usually merely deduplication, which is not hard. I have the
>feeling that you think there's lots of complexity where it isn't. I'd
>definitely say that the heuristics you propose in order to be able to
>process headers on their own are much higher
> than the issue of deduplicating edits.

If it's just deduplication, it's no big deal - totally agreed.

I may have been thrown off by this: https://github.com/revane/migmerge_git

That tool seems to (aim to) handle a lot more than deduplication.

More importantly: I certainly hope it's not going to be necessary to
download a separate "merge tool" in order to use cpp11-migrate.

I actually agree that it makes sense to have cpp11-migrate work on single
TUs (or multiple TUs) without the need for an extra tool, by just running
in process over all the TUs and applying all the changes.

To me, cpp11-migrate is the kind of tool I'm just not going to use if it's
not dead simple to use. As a user, I don't really care how it works
internally, and I don't want to care :). I _am_ willing to tell it (a
reasonable amount of) things about my code base that it can't easily infer
itself.

cpp11-migrate is something that I'd expect most users to run *once* in
their life, over a whole code-base. I don't see that the extra overhead of
running a special tool would be too much effort. I also think it'd be great
if it still worked without the extra tool (as noted above), but I don't
think it's a necessity.

Cheers,
/Manuel

<djasper@google.com>, "Clang Dev List (cfe-dev@cs.uiuc.edu)"
<cfe-dev@cs.uiuc.edu>
clang::tooling::Replacement data on disk

To me, cpp11-migrate is the kind of tool I'm just not going to use if
it's
not dead simple to use. As a user, I don't really care how it works
internally, and I don't want to care :). I _am_ willing to tell it (a
reasonable amount of) things about my code base that it can't easily
infer
itself.

cpp11-migrate is something that I'd expect most users to run *once* in
their
life, over a whole code-base. I don't see that the extra overhead of
running
a special tool would be too much effort. I also think it'd be great if it
still worked without the extra tool (as noted above), but I don't think
it's
a necessity.

I think that's precisely why the cost of getting started needs to be
really low. I'm not going to invest in a lot of up-front "learning" for a
tool I'm likely to use once. I want to type a command and see the results.
Get me hooked, then I'll start playing with parameters :slight_smile:

Stefanus

From: Manuel Klimek [mailto:klimek@google.com]
Sent: Wednesday, July 31, 2013 12:46 PM
To: Du Toit, Stefanus
Cc: Vane, Edwin; Daniel Jasper; Clang Dev List (cfe-dev@cs.uiuc.edu)
Subject: Re: [cfe-dev] FW: RFC: YAML as an intermediate format for
clang::tooling::Replacement data on disk

...

"Merging" is usually merely deduplication, which is not hard. I have the feeling
that you think there's lots of complexity where it isn't. I'd definitely say that the
heuristics you propose in order to be able to process headers on their own are
much higher than the issue of deduplicating edits.

Is it? Is it not possible for conflicts to occur? If so we must detect these conflicts for the user to sort out. Admittedly, migmerge_git approaches the merge problem from a contextual diff approach but it's the difficulty of detecting conflicts that made that approach so appealing in the first place.

From: Manuel Klimek [mailto:klimek@google.com]
Sent: Wednesday, July 31, 2013 1:28 PM
To: Du Toit, Stefanus
Cc: Vane, Edwin; Daniel Jasper; Clang Dev List (cfe-dev@cs.uiuc.edu)
Subject: Re: [cfe-dev] FW: RFC: YAML as an intermediate format for
clang::tooling::Replacement data on disk

...

cpp11-migrate is something that I'd expect most users to run *once* in their life,
over a whole code-base. I don't see that the extra overhead of running a special
tool would be too much effort. I also think it'd be great if it still worked without
the extra tool (as noted above), but I don't think it's a necessity.

A single-command solution is pretty easy to achieve by something like the clang driver model. We get multiple TU's transformed and then finish with post processor. I think the two command solution is a reasonable first step, and as you point out, not exactly taxing for the user.

Just to clarify an important design constraint from my perspective:

It shouldn't be just any tool to aggregate, deduplicate, and apply all the
changes. It should specifically be the *same* code path that a tool uses
when it runs over all TUs in-process. This is to me *really* important to
ensure we get consistent behavior across these two possible workflows:

1) Run tool X over code in a.cc, b.cc, and c.cc in a single invocation.
Internally hands rewrites to a library in Clang that de-dups and applies
the edits.

2) Run tool X over code in a.cc, b.cc, and c.cc; one invocation pre file.
Each run writes out its edits in some form. Run tool Y which reads these
edits and hands them to a library in Clang that de-dups and applies the
edits.

So, I essentially think that it *can't* be a separate system from Clang
itself, it is intrinsically tied or it will yield different behavior (and
endless confusion).

I agree in principle, but I think the common core should live in
lib/Tooling (and already does), and it's really really small (as it simply
deduplicates the edits, perhaps reports conflicts, and then just uses the
Rewriter to apply the changes - those are around 10 LOC).

Everything else in the extra tool is about reading the changes from disk,
and using multi-threading to apply them in a scalable way.

Cheers,
/Manuel

Yes, erring out when conflicts occur is good, but I also consider it a
rather trivial amount of code.

From: Manuel Klimek [mailto:klimek@google.com]
Sent: Wednesday, July 31, 2013 2:04 PM
To: Chandler Carruth
Cc: Vane, Edwin; Daniel Jasper; Clang Dev List (cfe-dev@cs.uiuc.edu)
Subject: Re: [cfe-dev] FW: RFC: YAML as an intermediate format for
clang::tooling::Replacement data on disk

...

Everything else in the extra tool is about reading the changes from disk, and
using multi-threading to apply them in a scalable way.

How did you envision parallelism being used? Reading files off the disk right now means iterating over the directory structure looking for files. De-duplicating and reporting conflicts, done as part of clang::tooling, would need a rewrite to use a parallel algorithm if this is where you intended parallelism. Then applying the deduplicated replacements would require the Rewriter is thread-safe. This all presumes there would be a large number of change descriptions for a given header. Is this something that happens in your experience? Or am I misunderstanding where you want multi-threading?

As noted, I think the design generally makes sense - my main concern is
that the tool that applies the replacements should not be
cpp11-migrator-specific. This would be useful for basically *all*
refactorings.

I think in the end we'll want 2 things:
- some functions in lib/Tooling/ that allow outputting those replacement
collections from clang-tools
- a tool to aggregate, deduplicate and apply all the changes

Just to clarify an important design constraint from my perspective:

It shouldn't be just any tool to aggregate, deduplicate, and apply all
the changes. It should specifically be the *same* code path that a tool
uses when it runs over all TUs in-process. This is to me *really* important
to ensure we get consistent behavior across these two possible workflows:

1) Run tool X over code in a.cc, b.cc, and c.cc in a single invocation.
Internally hands rewrites to a library in Clang that de-dups and applies
the edits.

2) Run tool X over code in a.cc, b.cc, and c.cc; one invocation pre file.
Each run writes out its edits in some form. Run tool Y which reads these
edits and hands them to a library in Clang that de-dups and applies the
edits.

So, I essentially think that it *can't* be a separate system from Clang
itself, it is intrinsically tied or it will yield different behavior (and
endless confusion).

I agree in principle, but I think the common core should live in
lib/Tooling (and already does), and it's really really small (as it simply
deduplicates the edits, perhaps reports conflicts, and then just uses the
Rewriter to apply the changes - those are around 10 LOC).

Everything else in the extra tool is about reading the changes from disk,

This makes sense to be in the helper tool...

and using multi-threading to apply them in a scalable way.

I think this should be in the core code. If you have 10k files that you
want to edit and you can analyze them in-process fast enough (may become
realistic w/ modules), we should also apply the edits in a scalable way.

Clearly, this can always be an incremental thing, I'm just trying to
clarify the important constraint for me on the end state.

So, the way we parallelize is:
-> read all the replacements
-> shard them by source file on which they're applied
-> in parallel for each source file:
--> deduplicate
--> apply / write

The last part can be slower when networked file systems or strange source
control systems are involved (which is our case), or simply avoid some i/o
latency.

As noted, I think the design generally makes sense - my main concern is
that the tool that applies the replacements should not be
cpp11-migrator-specific. This would be useful for basically *all*
refactorings.

I think in the end we'll want 2 things:
- some functions in lib/Tooling/ that allow outputting those
replacement collections from clang-tools
- a tool to aggregate, deduplicate and apply all the changes

Just to clarify an important design constraint from my perspective:

It shouldn't be just any tool to aggregate, deduplicate, and apply all
the changes. It should specifically be the *same* code path that a tool
uses when it runs over all TUs in-process. This is to me *really* important
to ensure we get consistent behavior across these two possible workflows:

1) Run tool X over code in a.cc, b.cc, and c.cc in a single invocation.
Internally hands rewrites to a library in Clang that de-dups and applies
the edits.

2) Run tool X over code in a.cc, b.cc, and c.cc; one invocation pre
file. Each run writes out its edits in some form. Run tool Y which reads
these edits and hands them to a library in Clang that de-dups and applies
the edits.

So, I essentially think that it *can't* be a separate system from Clang
itself, it is intrinsically tied or it will yield different behavior (and
endless confusion).

I agree in principle, but I think the common core should live in
lib/Tooling (and already does), and it's really really small (as it simply
deduplicates the edits, perhaps reports conflicts, and then just uses the
Rewriter to apply the changes - those are around 10 LOC).

Everything else in the extra tool is about reading the changes from disk,

This makes sense to be in the helper tool...

and using multi-threading to apply them in a scalable way.

I think this should be in the core code. If you have 10k files that you
want to edit and you can analyze them in-process fast enough (may become
realistic w/ modules), we should also apply the edits in a scalable way.

I think the sequential part of that step should be in the core code, but
the parallelization will be in a (small) tool. This is also where
special-case logic for custom networked file systems or strange version
control systems would fall in.

Clearly, this can always be an incremental thing, I'm just trying to
clarify the important constraint for me on the end state.

Agreed.

I really disagree. I don't know why we wouldn't aim to eventually have the
core clang logic know all the necessary tricks to scalaing all aspects of
this up. It shouldn't be sequestered like this.

Consider, if you have the ability to do any amount of parallelism in the
tool, you'd often like to pipeline the analysis and the updating of files,
all while maintaining de-dup coherency etc. This would essentially require
the same code to be managing both the running of the tool and the writing
of the updates. In the case where you *don't* serialize the edits through
some other layer, I think you still will want this, and thus it belongs in
common shared library code.

Naturally parallelism within a process of this kind (or any kind) of this
kind is a long way off. I just don't want to not be thinking about it in
terms of what the end goal looks like.

I think the sequential part of that step should be in the core code, but
the parallelization will be in a (small) tool. This is also where
special-case logic for custom networked file systems or strange version
control systems would fall in.

I really disagree. I don't know why we wouldn't aim to eventually have the
core clang logic know all the necessary tricks to scalaing all aspects of
this up. It shouldn't be sequestered like this.

Consider, if you have the ability to do any amount of parallelism in the
tool, you'd often like to pipeline the analysis and the updating of files,
all while maintaining de-dup coherency etc. This would essentially require
the same code to be managing both the running of the tool and the writing
of the updates. In the case where you *don't* serialize the edits through
some other layer, I think you still will want this, and thus it belongs in
common shared library code.

While in an ideal world I would agree, I think the possible parallelization
frameworks we'll want to support are different enough that we always want
the parallelization framework to drive the analysis, and provide atoms that
can be put together effectively to achieve that.

I'm not saying that we shouldn't provide some common mechanism for
parallelizing in the core, I just think we need to architect it in a way
that the parallelization can be driven from outside, at which point the
parallelization in the core becomes less central to the whole design, and
thus I think less important to argue about.