Looking for suggestions: Inferring GPU memory accesses

Hi all,

As part of my research I want to investigate the relation between the grid’s geometry and the memory accesses of a kernel in common gpu benchmarks (e.g Rodinia, Polybench etc). As a first step i want to answer the following question:

  • Given a kernel function with M possible memory accesses. For how many of those M accesses we can statically infer its location given concrete values for the grid/block and executing thread?

(Assume CUDA only for now)

My initial idea is to replace all uses of dim-related values, e.g:
__cuda_builtin_blockDim_t::__fetch_builtin_x()
__cuda_builtin_gridDim_t::__fetch_builtin_x()

and index related values, e.g:
__cuda_builtin_blockIdx_t::__fetch_builtin_x()
__cuda_builtin_threadIdx_t::__fetch_builtin_x()

with ConstantInts. Then run constant folding on the result and check how many GEPs have constant values.

Would something like this work or are there complications I am not thinking of? I’d appreciate any suggestions.

P.S i am new to LLVM

Thanks in advance,
Ees

CUDA/GPU programs are written for a SIMT SIMD model, which means single instruction, multiple threads and multiple data. Programmers write a single program in such a way that each thread would execute it with different data. So, a program is one physical copy but virtually it’s run by several threads so those grid/thread IDs are really meant for semantics of the program. You can’t replace thread specific variables with one thread ID.

Hence, I don’t think what you’re proposing would have much applicability in real-world benchmarks like Rodinia.If you have a strong motivating example then please provide a counter argument but in my experience, it won’t be much useful.

In some corner cases, it would be useful but those would be a general case of uniform code blocks.

Hi Madhur and thanks for your answer.

You can’t replace thread specific variables with one thread ID.

Why not? Let me rephrase. What I’m looking for at this stage is to be able to pick a thread in a block, and see for this particular thread, how many memory accesses in the kernel are (statically) inferable.

For instance for these kernels https://github.com/yuhc/gpu-rodinia/blob/0739f8045ca9d8153b06973a8b10f6d97485cd72/cuda/gaussian/gaussian.cu#L309 if you provide concrete values for grid block and index as well as the scalar arguments you can tell (manually) which offsets off of the pointer arguments are being accessed by the kernel.
In contrast, in a kernel like this https://github.com/yuhc/gpu-rodinia/blob/0739f8045ca9d8153b06973a8b10f6d97485cd72/cuda/huffman/hist.cu#L34 you cant infer them all because some indices are data-dependent.

What i’m looking for - and again, this is only a first step to something bigger - is to automate this process.

Στις Σάβ, 22 Αυγ 2020 στις 5:38 μ.μ., ο/η Madhur Amilkanthwar <madhur13490@gmail.com> έγραψε:

Hi Ees,

a while back we started a project with similar scope.
Unfortunately the development slowed down and the plans to revive it this summer got tanked by the US travel restrictions.

Anyway, there is some some existing code that might be useful, though in a prototype stage. While I'm obviously biased, I would suggest we continue from there.

@Alex @Holger can we put the latest version on github or some other place to share it, I'm unsure if the code I (might have) access to is the latest.

@Ees I attached a recent paper and you might find the following links useful:

\* 2017 LLVM Developers’ Meeting: J\. Doerfert “Polyhedral Value &amp; Memory Analysis ” https://youtu.be/xSA0XLYJ-G0

\* &quot;Automated Partitioning of Data\-Parallel Kernels using Polyhedral Compilation\.&quot;, P2S2 2020 \(slides and video https://www.mcs.anl.gov/events/workshops/p2s2/2020/program.php)

Let us know what you think :slight_smile:

~ Johannes

> Hi all,
>
> As part of my research I want to investigate the relation between the
> grid's geometry and the memory accesses of a kernel in common gpu
> benchmarks (e.g Rodinia, Polybench etc). As a first step i want to
> answer the following question:
>
> - Given a kernel function with M possible memory accesses. For how many of
> those M accesses we can statically infer its location given concrete values
> for the grid/block and executing thread?
>
> (Assume CUDA only for now)
>
> My initial idea is to replace all uses of dim-related values, e.g:
> __cuda_builtin_blockDim_t::__fetch_builtin_x()
> __cuda_builtin_gridDim_t::__fetch_builtin_x()
>
> and index related values, e.g:
> __cuda_builtin_blockIdx_t::__fetch_builtin_x()
> __cuda_builtin_threadIdx_t::__fetch_builtin_x()
>
> with ConstantInts. Then run constant folding on the result and check how
> many GEPs have constant values.
>
> Would something like this work or are there complications I am not thinking

icppworkshops20-13.pdf (911 KB)

@Ees,
Oh, I see what you mean now. Doing such analysis would be useful for a thread block and not just a single thread but as you say you are onto something bigger than just a thread.

We had published a short paper in ICS around this which uses polyhedral techniques to do such analysis and reason about uncoalesced access patterns in Cuda programs. You can find paper at
https://dl.acm.org/doi/10.1145/2464996.2467288

Hello Johannes,

Thank you very much for the material. I will have a look and get back to you (possibly with questions if you don't mind :slight_smile: ).
I would also appreciate the code if that's available.

- Ees

@Madhur Thank you i will have a look at the paper.

Doing such analysis would be useful for a thread block and not just a single thread

Do you have any concrete use cases in mind?

I was thinking that i could use such an analysis to, for instance, visualize the memory accesses performed by the kernel (or at least the ones that it is possible to infer). Relevant literature i find always involves tracing every access. So I’m thinking that with something like this, tracing can be (potentially) significantly reduced.

-Ees

I don’t have any concrete cases off the top of my head, but work Johannes et al. is definitely interesting to me. I hope doing some more literature survey on the similar lines would be useful for your research work. I think research work who cited our work developed on some of the ideas we proposed. You can probably look at those use cases.

Hey Ees, Johannes,

AFAIK the latest version of the code should be publicly accessible here (it just isn’t maintained):
https://github.com/UniHD-CEG/mekong-cuda

Johannes low level polyhedral analysis is in “llvm-patches/”, which you should be able to apply to the LLVM trunk around Jan 8, 2018 (I don’t remember the commit, but it’s unlikely there are changes breaking the relevant APIs just randomly on that day).
It already comes with support to recognize the standard “threadIdx.{} + blockIdx.{} * blockDim.{}” (with {} being either x, y, or z), making the whole expression a “constant”, removing the non-linearity and enabling polyhedral analysis.

The post-processing we’re applying to the raw analysis result is in “lib/MeKernelAnalysis.cpp”.
I’m not sure how helpful that will be for you, because we were going in a different direction and combining all accesses of a thread, but it might be helpful as a fairly simple example of how to use polyhedral analysis from within LLVM (and of course it was developed shooting from the hip, without tests).

P.S.: I like your idea of inlining everything, then making the CUDA intrinsics constants and applying constant folding. It sounds simple enough to get something working quickly that can then be expanded upon

Cheers,
Alex