Musings on the TableGen -emit-dag-isel backend

A rather notorious aspect of TableGen is the time required to run the
-emit-dag-isel backend on some targets, including AMDGPU and X86. I added a
timing feature to TableGen and timed the AMDGPU run.

This is great! Thanks Paul!

I think that the 9x reduction in compile-time is well worth the 4% size increase. TableGen's run-time has been a sore point and a source of complaints for quite some time.

I wouldn’t want to be too hasty about simply removing the relaxation algorithm. The size and speed of the compiler affects all users, but the time to compile the compiler “only” affects us compiler developers. And I speak as a developer who is heavily affected by the time to compile the AMDGPU backend.

One off-the-cuff idea (I haven’t even looked at the code yet): could we pass in a null output stream while doing the relaxation phase, and use that to skip most of the actual textual formatting, followed by a final pass with a real output stream once the offset has been determined?

Jay.

Here’s a slightly more on-the-cuff idea. The problem with the current implementation is that it iterates at each level of the tree, which gives rise to the “O(1.7^(n-1)), where n is the depth of the pattern matching tree” cost.

But you can get the same behaviour with just two (linear) passes over the whole tree. The first pass computes the size in bytes of each node, and the second one actually emits it. The key here is that to compute the size of a node, you can add up the sizes of its children and /then/ add the size of the VBR size field. By contrast, when you emit the record, you have to emit the VBR size field /first/. You only need to pass in accurate offset (“CurrentIdx”) information for the second pass, since that info does not affect the byte size of the records emitted. On the second pass you could assert that the emitted size is the same as what was calculated in the first pass.

As for how you change the current implementation to work that way… I would leave that to the tablegen code owner :slight_smile: But I suggest that passing in a null output stream is still a good way to tell EmitMatcher that it is being called as part of the first pass.

Jay.

Sure. I wouldn’t think that 4% of increase in the table size would have a significant impact on performance, but yes, the final decision would need to take that into account.

This is the size of the table, not the size of the overall binary, right? I would imagine that a 4% growth in the size of the table is a substantially smaller growth in the total executable size of, say, clang. If the overall growth is minuscule (say, under 1%), then this seems like the clear path forward.

I’m also optimistic that we might be able to find other ways to shrink the tables to make up the difference if we need to.

—Owen

Can we skip the relaxation step if comments aren’t being emitted? Comment emission is off by default in CMake configuration.