Module build - tokenized form of intermediate source stream

Hi all,

Now building a module involves creation of intermediate source streams that includes/imports each header composing the module. This source stream is then parsed as if it were a source file. So to build a module several transformations must be done:

  • Module map is parsed to produce module objects(clang::Module),
  • Module objects are used to build source stream (llvm::MemoryBuffer), which contains include directives,
  • The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we prepared a sequence of annotation tokens, annot_module_begin, annot_module_end and some new token, say annot_module_header, which represented a header of a module. It would be something like pretokenized header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in PR24667 ([Regression] Quadratic module build time due to Preprocessor::LeaveSubmodule). The reason of the problem is leaving module after each header, even if the next header is of the same module. Leaving module after the last header would be a solution but it is problematic to reveal if the header just parsed is the last one, - there is no such thing as look ahead of the next include directive. Using tokenized input would mark module ends easily.

Is there any reason why textual form of the intermediate source stream should be kept? Does implementing tokenized form of it make sense?

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to produce a
separate submodule state for distinct submodules even when not in local
visibility mode, and lazily populate its Macros map when identifiers are
queried. That way, the performance is linear in the number of macros the
submodule actually defines or uses, not in the total number defined or used
by the top-level module.

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

The "one huge submodule" approach with no local visibility is actually very
useful to have because it (for better or for worse) is very close to the
semantics of PCH (which are very simple). This makes it a nice incremental
step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering that
this isn't documented anywhere to my knowledge. In fact, the documentation
I've written internally for my customers recommends the exact opposite for
the reason described above.

-- Sean Silva

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem will
be solved with different means. If a user follow this recommendation and
puts each header into a separate module, he won't suffer from the tokenized
form of the intermediate input stream. If the user chooses to put many
headers into one module, this change can solve the problem. The cited PR
refers to just the latter case.

The "one huge submodule" approach with no local visibility is actually
very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering that
this isn't documented anywhere to my knowledge. In fact, the documentation
I've written internally for my customers recommends the exact opposite for
the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

Is there a technical reason for this? Is there a difference (say bigger module size or slower deserialization) between a header file per submodule and a hearder file per standalone module? We stumble upon (see attachment) cases which compile just fine without modules and with standalone modules (i.e. header per module). They do not compile with the submodule model. In the attached example including B.h registers the top-most module, which pollutes the lookup (with the lookup entries of A) thus the behavior with a non-module build differs. Vassil [snip]

submodule_visibility.diff (1.1 KB)

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

Is there a technical reason for this? Is there a difference (say bigger
module size or slower deserialization) between a header file per submodule
and a hearder file per standalone module?

The technical reason is that it gives more precise control over name export
/ import -- that is, if you don't do this then #including a modular header
file can make too many names visible, and if you develop using that
approach then your builds will likely fail due to use of undeclared names
when you build without modules enabled.

We stumble upon (see attachment) cases which compile just fine without

modules and with standalone modules (i.e. header per module). They do not
compile with the submodule model.

That's somewhat separate from what we're talking about; this also doesn't
compile with the "all the headers in the same module with no submodules"
approach. The problem here is that you're violating
[basic.scope.declarative]p4, so your program is ill-formed, and with
modules enabled Clang is able to detect and diagnose this. (I'm inclined to
permit your example -- we should only really be diagnosing redeclaration
conflicts between entities if either both have linkage or the old
declaration is visible -- and if we did so then the
one-submodule-per-header approach would work and the one-big-module
approach would fail for your testcase.)

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem will
be solved with different means. If a user follow this recommendation and
puts each header into a separate module, he won't suffer from the tokenized
form of the intermediate input stream. If the user chooses to put many
headers into one module, this change can solve the problem. The cited PR
refers to just the latter case.

I think you're missing my point. We seem to have a choice between a general
solution that addresses the problem in all cases, and a solution that only
helps for the "one big module with no submodules" case (which is not the
case that you get for, say, an umbrella directory module / umbrella header
/ libc++ / Darwin's libc / ...). If these solutions don't have drastically
different technical complexity, the former seems like the better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is actually

very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering that
this isn't documented anywhere to my knowledge. In fact, the documentation
I've written internally for my customers recommends the exact opposite for
the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the user of
the API. If they want to import all of the Clang API, that should work (and
if you add an umbrella "clang.h" header, it will work), but if they just
#include some small part of that interface, should they really get the
whole thing?

-- Sean Silva

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream we
prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported in
PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

Is there a technical reason for this? Is there a difference (say bigger
module size or slower deserialization) between a header file per submodule
and a hearder file per standalone module?

The technical reason is that it gives more precise control over name
export / import -- that is, if you don't do this then #including a modular
header file can make too many names visible, and if you develop using that
approach then your builds will likely fail due to use of undeclared names
when you build without modules enabled.

My experience with modularizing (and the advice that I give to my
customers) is to first use "all the headers in the same module with no
submodules" approach, and then treat the submodule feature as a way of
tightening things up once they work. This seems to be the most
understandable and easiest, since during the initial step of making the
"one huge module with no submodules" the errors can be diagnosed very
similarly to PCH/textual inclusion, which is intuitive for users.
Incrementally tightening things up then happens at fine granularity and the
issues are easy to pinpoint.

The errors that occur when submodules are used tend to be extremely
difficult to deceipher since they cannot be debugged with a "textual"/"PCH"
mental model. I say this as a person who has tried and failed to modularize
significant amounts of real-world code before devising this approach as a
way to systematically succeed at the task.

Got it, thanks. Yes it doesn’t compile with “all the headers in the same module with no submodules”. It compiles just fine with standalone modules (commenting out module Top in the example). I am totally for allowing this to work. It will make the migration to modules in our case (maybe not only?) a lot easier. Could I help addressing this issue and where should I start from? Vassil

Hi all,

Now building a module involves creation of intermediate source streams
that includes/imports each header composing the module. This source stream
is then parsed as if it were a source file. So to build a module several
transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream (llvm::MemoryBuffer),
which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream
we prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported
in PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem
will be solved with different means. If a user follow this recommendation
and puts each header into a separate module, he won't suffer from the
tokenized form of the intermediate input stream. If the user chooses to put
many headers into one module, this change can solve the problem. The cited
PR refers to just the latter case.

I think you're missing my point. We seem to have a choice between a
general solution that addresses the problem in all cases, and a solution
that only helps for the "one big module with no submodules" case (which is
not the case that you get for, say, an umbrella directory module / umbrella
header / libc++ / Darwin's libc / ...). If these solutions don't have
drastically different technical complexity, the former seems like the
better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is actually

very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering that
this isn't documented anywhere to my knowledge. In fact, the documentation
I've written internally for my customers recommends the exact opposite for
the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the user
of the API. If they want to import all of the Clang API, that should work
(and if you add an umbrella "clang.h" header, it will work), but if they
just #include some small part of that interface, should they really get the
whole thing?

-- Sean Silva

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to produce a
separate submodule state for distinct submodules even when not in local
visibility mode, and lazily populate its Macros map when identifiers are
queried. That way, the performance is linear in the number of macros the
submodule actually defines or uses, not in the total number defined or used
by the top-level module.

That is we need to maintain an object of type SubmoduleState for each
module in all modes. The SubmoduleState is extended by new field that
represents a map from IdentifierInfo* to ModuleMacro, which is populated
when preprocessor tries to find if the identifier used in the source is a
macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
before the module is serialized, SubmoduleState::Macro is scanned and for
identifiers that do not have associated ModuleMacro, the latter is created.

Probably we need to introduce new flag in IdentifierInfo, something like
'NotAMacro', to mark identifiers, that were checked if they are macro names
and found they are not. It would allow to avoid extra look-ups. If flag
HasMacro is set, this flag is cleared.

It looks like we have to use complex procedure because we need to support
the case when one header defines a macro and another only uses it. In this
case macro state must be kept somewhere if LeaveSubmodule is called between
headers.

What about such implementation?

Hi all,

Now building a module involves creation of intermediate source
streams that includes/imports each header composing the module. This
source stream is then parsed as if it were a source file. So to build a
module several transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream
(llvm::MemoryBuffer), which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream
we prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported
in PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem
will be solved with different means. If a user follow this recommendation
and puts each header into a separate module, he won't suffer from the
tokenized form of the intermediate input stream. If the user chooses to put
many headers into one module, this change can solve the problem. The cited
PR refers to just the latter case.

I think you're missing my point. We seem to have a choice between a
general solution that addresses the problem in all cases, and a solution
that only helps for the "one big module with no submodules" case (which is
not the case that you get for, say, an umbrella directory module / umbrella
header / libc++ / Darwin's libc / ...). If these solutions don't have
drastically different technical complexity, the former seems like the
better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is actually

very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering that
this isn't documented anywhere to my knowledge. In fact, the documentation
I've written internally for my customers recommends the exact opposite for
the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the user
of the API. If they want to import all of the Clang API, that should work
(and if you add an umbrella "clang.h" header, it will work), but if they
just #include some small part of that interface, should they really get the
whole thing?

-- Sean Silva

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to produce
a separate submodule state for distinct submodules even when not in local
visibility mode, and lazily populate its Macros map when identifiers are
queried. That way, the performance is linear in the number of macros the
submodule actually defines or uses, not in the total number defined or used
by the top-level module.

That is we need to maintain an object of type SubmoduleState for each
module in all modes. The SubmoduleState is extended by new field that
represents a map from IdentifierInfo* to ModuleMacro, which is populated
when preprocessor tries to find if the identifier used in the source is a
macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
before the module is serialized, SubmoduleState::Macro is scanned and for
identifiers that do not have associated ModuleMacro, the latter is created.

Probably we need to introduce new flag in IdentifierInfo, something like
'NotAMacro', to mark identifiers, that were checked if they are macro names
and found they are not. It would allow to avoid extra look-ups. If flag
HasMacro is set, this flag is cleared.

It looks like we have to use complex procedure because we need to support
the case when one header defines a macro and another only uses it. In this
case macro state must be kept somewhere if LeaveSubmodule is called between
headers.

What about such implementation?

That seems pretty invasive. I'm not sure it is worth it; the case that I
reduced to PR24667 was fairly extreme (all headers for a large project
(~size of LLVM) in a single top-level module). I'm not sure how likely it
is that this will be ran into in practice. It's definitely worth fixing on
the principle of avoiding quadratic behavior, but it isn't (currently)
blocking a real-world use case, so I hesitate to do very invasive changes.

-- Sean Silva

Hi all,

Now building a module involves creation of intermediate source
streams that includes/imports each header composing the module. This
source stream is then parsed as if it were a source file. So to build a
module several transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream
(llvm::MemoryBuffer), which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source stream
we prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation reported
in PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule, so
optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem
will be solved with different means. If a user follow this recommendation
and puts each header into a separate module, he won't suffer from the
tokenized form of the intermediate input stream. If the user chooses to put
many headers into one module, this change can solve the problem. The cited
PR refers to just the latter case.

I think you're missing my point. We seem to have a choice between a
general solution that addresses the problem in all cases, and a solution
that only helps for the "one big module with no submodules" case (which is
not the case that you get for, say, an umbrella directory module / umbrella
header / libc++ / Darwin's libc / ...). If these solutions don't have
drastically different technical complexity, the former seems like the
better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is actually

very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering
that this isn't documented anywhere to my knowledge. In fact, the
documentation I've written internally for my customers recommends the exact
opposite for the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the user
of the API. If they want to import all of the Clang API, that should work
(and if you add an umbrella "clang.h" header, it will work), but if they
just #include some small part of that interface, should they really get the
whole thing?

-- Sean Silva

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to produce
a separate submodule state for distinct submodules even when not in local
visibility mode, and lazily populate its Macros map when identifiers are
queried. That way, the performance is linear in the number of macros the
submodule actually defines or uses, not in the total number defined or used
by the top-level module.

That is we need to maintain an object of type SubmoduleState for each
module in all modes. The SubmoduleState is extended by new field that
represents a map from IdentifierInfo* to ModuleMacro, which is populated
when preprocessor tries to find if the identifier used in the source is a
macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
before the module is serialized, SubmoduleState::Macro is scanned and for
identifiers that do not have associated ModuleMacro, the latter is created.

Probably we need to introduce new flag in IdentifierInfo, something like
'NotAMacro', to mark identifiers, that were checked if they are macro names
and found they are not. It would allow to avoid extra look-ups. If flag
HasMacro is set, this flag is cleared.

It looks like we have to use complex procedure because we need to support
the case when one header defines a macro and another only uses it. In this
case macro state must be kept somewhere if LeaveSubmodule is called between
headers.

What about such implementation?

That seems pretty invasive. I'm not sure it is worth it; the case that I
reduced to PR24667 was fairly extreme (all headers for a large project
(~size of LLVM) in a single top-level module). I'm not sure how likely it
is that this will be ran into in practice. It's definitely worth fixing on
the principle of avoiding quadratic behavior, but it isn't (currently)
blocking a real-world use case, so I hesitate to do very invasive changes.

Modules must be valuable just for large projects, where compile time saving
can be substantial. So the problem described in PR24667 anyway should be
solved, with such approach or another.

--Serge

Hi all,

Now building a module involves creation of intermediate source
streams that includes/imports each header composing the module. This
source stream is then parsed as if it were a source file. So to build a
module several transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream
(llvm::MemoryBuffer), which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source
stream we prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation
reported in PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule,
so optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem
will be solved with different means. If a user follow this recommendation
and puts each header into a separate module, he won't suffer from the
tokenized form of the intermediate input stream. If the user chooses to put
many headers into one module, this change can solve the problem. The cited
PR refers to just the latter case.

I think you're missing my point. We seem to have a choice between a
general solution that addresses the problem in all cases, and a solution
that only helps for the "one big module with no submodules" case (which is
not the case that you get for, say, an umbrella directory module / umbrella
header / libc++ / Darwin's libc / ...). If these solutions don't have
drastically different technical complexity, the former seems like the
better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is actually

very useful to have because it (for better or for worse) is very close to
the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering
that this isn't documented anywhere to my knowledge. In fact, the
documentation I've written internally for my customers recommends the exact
opposite for the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the
user of the API. If they want to import all of the Clang API, that should
work (and if you add an umbrella "clang.h" header, it will work), but if
they just #include some small part of that interface, should they really
get the whole thing?

-- Sean Silva

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to
produce a separate submodule state for distinct submodules even when not in
local visibility mode, and lazily populate its Macros map when identifiers
are queried. That way, the performance is linear in the number of macros
the submodule actually defines or uses, not in the total number defined or
used by the top-level module.

That is we need to maintain an object of type SubmoduleState for each
module in all modes. The SubmoduleState is extended by new field that
represents a map from IdentifierInfo* to ModuleMacro, which is populated
when preprocessor tries to find if the identifier used in the source is a
macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
before the module is serialized, SubmoduleState::Macro is scanned and for
identifiers that do not have associated ModuleMacro, the latter is created.

Probably we need to introduce new flag in IdentifierInfo, something like
'NotAMacro', to mark identifiers, that were checked if they are macro names
and found they are not. It would allow to avoid extra look-ups. If flag
HasMacro is set, this flag is cleared.

It looks like we have to use complex procedure because we need to
support the case when one header defines a macro and another only uses it.
In this case macro state must be kept somewhere if LeaveSubmodule is called
between headers.

What about such implementation?

That seems pretty invasive. I'm not sure it is worth it; the case that I
reduced to PR24667 was fairly extreme (all headers for a large project
(~size of LLVM) in a single top-level module). I'm not sure how likely it
is that this will be ran into in practice. It's definitely worth fixing on
the principle of avoiding quadratic behavior, but it isn't (currently)
blocking a real-world use case, so I hesitate to do very invasive changes.

Modules must be valuable just for large projects, where compile time
saving can be substantial. So the problem described in PR24667 anyway
should be solved, with such approach or another.

Large projects are composed of many small sub-parts. I don't think any real
project has a "module" with >1000 headers. I just tested the performance
varying the number of headers in the module and the number of macros per
header. My results in Mathematica can be seen here:
http://i.imgur.com/E6g0g0M.png (testit_formathematica.py attached)
Even for the case of 128 headers with 500 macros each (in a single
top-level module with no submodules) the slowdown is less than 3x vs. clang
3.6. So this isn't the end of the world (it's not like compilations won't
finish; in the case I ran into in "practice" in my experiments with >1000
headers in a single top-level module with no submodules, the module was
taking ~60 seconds to build).

Like I said, this is worth fixing on principle of avoiding quadratic
behavior. The most likely case that I can think of that would occur in
practice where this quadratic would really bite is when initially
modularizing an entire SDK top-level include directory that has a bunch of
stuff in it; for both Mac and PS4 this is a couple hundred headers. With
512 headers and 100 macros per header (testing this with a modified version
of my Mathematica notebook) this gives 9x slowdown, which is a lot, but
this sort of situation is rare. Once the modularization is done though,
there are submodules at much smaller granularity, so the problem disappears.

I think Richard's suggestion for having a #pragma for starting/stopping a
submodule is a really good idea. Among other things, this would make module
maps nothing but "syntax sugar" for something that can be done directly in
the language. It fixes this issue and I think it may also make it easier
for users to understand what is happening. It also might make it very easy
to migrate from PCH to modules. I have in practice actually explained the
way module maps work in terms of the synthesized header file (and it seems
to be a good way to describe it), so it is a natural to allow the
synthesized header file to actually be written (even if at first we use
__reserved names for the pragma at first so it isn't available to users;
someday we might open it up if this proves useful).

-- Sean Silva

testit_formathematica.py (1.15 KB)

Hi all,

Now building a module involves creation of intermediate source
streams that includes/imports each header composing the module. This
source stream is then parsed as if it were a source file. So to build a
module several transformations must be done:
- Module map is parsed to produce module objects(clang::Module),
- Module objects are used to build source stream
(llvm::MemoryBuffer), which contains include directives,
- The source stream is parsed to produce module content.

The build process could be simpler, if instead of text source
stream we prepared a sequence of annotation tokens, annot_module_begin,
annot_module_end and some new token, say annot_module_header, which
represented a header of a module. It would be something like pretokenized
header but without a counterpart in file system.

Such redesign would help in solving performance degradation
reported in PR24667 ([Regression] Quadratic module build time due to
Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
after each header, even if the next header is of the same module.

We generally recommend that each header goes in its own submodule,
so optimizing for this case doesn't address the problem for a lot of cases.

These are different use cases and there is nothing bad if the problem
will be solved with different means. If a user follow this recommendation
and puts each header into a separate module, he won't suffer from the
tokenized form of the intermediate input stream. If the user chooses to put
many headers into one module, this change can solve the problem. The cited
PR refers to just the latter case.

I think you're missing my point. We seem to have a choice between a
general solution that addresses the problem in all cases, and a solution
that only helps for the "one big module with no submodules" case (which is
not the case that you get for, say, an umbrella directory module / umbrella
header / libc++ / Darwin's libc / ...). If these solutions don't have
drastically different technical complexity, the former seems like the
better choice.

I'm not opposed to providing a token sequence rather than text for the
synthesized module umbrella header, but we'd need a reasonably strong
argument to justify the added complexity, especially as we still need our
current mode to handle umbrella headers on the file system, #includes
within modular headers, and so on. If we want something like that, a
simpler approach might be to add a pragma for starting / ending a module,
and emit that into the header file we synthesize, and then teach
PPLexerChange not to do the extra work when switching modules if the source
and destination module are actually the same.

The "one huge submodule" approach with no local visibility is

actually very useful to have because it (for better or for worse) is very
close to the semantics of PCH (which are very simple). This makes it a nice
incremental step away from PCH and very easy to understand.

Also, I think "we generally recommend" is a bit strong considering
that this isn't documented anywhere to my knowledge. In fact, the
documentation I've written internally for my customers recommends the exact
opposite for the reason described above.

This very convenient for users. Usually it is much simpler to write
something like #include "clang.h" instead of listing dozen of includes.
When API is distributed by many headers, a user must determine first where
the necessary piece is declared. In pre-module era splitting API was
unavoidable evil, as it reduced compile time. With modules we can enable
more convenient solutions.

I agree, but that seems to me that this should be the choice of the
user of the API. If they want to import all of the Clang API, that should
work (and if you add an umbrella "clang.h" header, it will work), but if
they just #include some small part of that interface, should they really
get the whole thing?

-- Sean Silva

Leaving module after the last header would be a solution but it is

problematic to reveal if the header just parsed is the last one, - there is
no such thing as look ahead of the next include directive. Using tokenized
input would mark module ends easily.

I have a different approach in mind for that case: namely, to
produce a separate submodule state for distinct submodules even when not in
local visibility mode, and lazily populate its Macros map when identifiers
are queried. That way, the performance is linear in the number of macros
the submodule actually defines or uses, not in the total number defined or
used by the top-level module.

That is we need to maintain an object of type SubmoduleState for each
module in all modes. The SubmoduleState is extended by new field that
represents a map from IdentifierInfo* to ModuleMacro, which is populated
when preprocessor tries to find if the identifier used in the source is a
macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
before the module is serialized, SubmoduleState::Macro is scanned and for
identifiers that do not have associated ModuleMacro, the latter is created.

Probably we need to introduce new flag in IdentifierInfo, something
like 'NotAMacro', to mark identifiers, that were checked if they are macro
names and found they are not. It would allow to avoid extra look-ups. If
flag HasMacro is set, this flag is cleared.

It looks like we have to use complex procedure because we need to
support the case when one header defines a macro and another only uses it.
In this case macro state must be kept somewhere if LeaveSubmodule is called
between headers.

What about such implementation?

That seems pretty invasive. I'm not sure it is worth it; the case that I
reduced to PR24667 was fairly extreme (all headers for a large project
(~size of LLVM) in a single top-level module). I'm not sure how likely it
is that this will be ran into in practice. It's definitely worth fixing on
the principle of avoiding quadratic behavior, but it isn't (currently)
blocking a real-world use case, so I hesitate to do very invasive changes.

Modules must be valuable just for large projects, where compile time
saving can be substantial. So the problem described in PR24667 anyway
should be solved, with such approach or another.

Large projects are composed of many small sub-parts. I don't think any
real project has a "module" with >1000 headers. I just tested the
performance varying the number of headers in the module and the number of
macros per header. My results in Mathematica can be seen here:
http://i.imgur.com/E6g0g0M.png (testit_formathematica.py attached)
Even for the case of 128 headers with 500 macros each (in a single
top-level module with no submodules) the slowdown is less than 3x vs. clang
3.6. So this isn't the end of the world (it's not like compilations won't
finish; in the case I ran into in "practice" in my experiments with >1000
headers in a single top-level module with no submodules, the module was
taking ~60 seconds to build).

Like I said, this is worth fixing on principle of avoiding quadratic
behavior. The most likely case that I can think of that would occur in
practice where this quadratic would really bite is when initially
modularizing an entire SDK top-level include directory that has a bunch of
stuff in it; for both Mac and PS4 this is a couple hundred headers. With
512 headers and 100 macros per header (testing this with a modified version
of my Mathematica notebook) this gives 9x slowdown, which is a lot, but
this sort of situation is rare. Once the modularization is done though,
there are submodules at much smaller granularity, so the problem disappears.

Interesting investigation. Having read this I got impression that quadratic
compile time is more a problem of perfect design than a user headache.
Indeed, large number of includes are more typical for big projects where it
is possible to spend some effort to make modularization. I guess that even
with dropped performance of module compilation, end users still gain
compile time of their application. So choosing a way to fix the problem we
should weight trade-offs between good design and quick solution.

I think Richard's suggestion for having a #pragma for starting/stopping a
submodule is a really good idea. Among other things, this would make module
maps nothing but "syntax sugar" for something that can be done directly in
the language. It fixes this issue and I think it may also make it easier
for users to understand what is happening. It also might make it very easy
to migrate from PCH to modules. I have in practice actually explained the
way module maps work in terms of the synthesized header file (and it seems
to be a good way to describe it), so it is a natural to allow the
synthesized header file to actually be written (even if at first we use
__reserved names for the pragma at first so it isn't available to users;
someday we might open it up if this proves useful).

Do you think introducing the #pragma is a temporary solution, because we do
not have general one now? Or this is some help for peoples doing
modularization and it worth existence even if the general solution will be
implemented?

Attempt to implement low-invasive fix to compile time problem: http://reviews.llvm.org/D13987 .