BSGP: bulk-synchronous GPU programming

A SIGGRAPH paper by Qiming Hou, Kun Zhou, Baining Guo, abstract:

We present BSGP, a new programming language for general purpose computation on the GPU. A BSGP program looks much the same as a sequential C program. Programmers only need to supply a bare minimum of extra information to describe parallel processing on GPUs. As a result, BSGP programs are easy to read, write, and maintain. Moreover, the ease of programming does not come at the cost of performance. A well-designed BSGP compiler converts BSGP programs to kernels and combines them using optimally allocated temporary streams. In our benchmark, BSGP programs achieve similar or better performance than well-optimized CUDA programs, while the source code complexity and programming time are significantly reduced. To test BSGP's code efficiency and ease of programming, we implemented a variety of GPU applications, including a highly sophisticated X3D parser that would be extremely difficult to develop with existing GPU programming languages.

The language acts to simplify CUDA, which reminds me of assembly code even if it uses C syntax, with, among other things, a higher-level memory model and implicit data-flow (so you don't have to explicitly partition your code between different kernels). Here is one trick that really impressed me:

findFaces(int* pf, int* hd, int* ib, int n) {
  spawn(n*3) {
    rk = thread.rank;
    f = rk/3;  
    v = ib[rk];
    thread.sortby(v); 
    require
      owner = dtempnew[n]int;
    rk = thread.rank;
    pf[rk] = f;
    owner[rk] = v;
    barrier;
    if (rk == 0||owner[rk-1] != v)
      hd[v] = rk;
  }
}

After the call to sortby, all threads are sorted by rank according to the values of v, rather than explicitly sorting a list or some other auxiliary data structure that would have to be allocated into memory. In other words, the call forces a reality where all the threads are coincidentally arranged in the way we want them to be...an interesting PL concept.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

References

Very briefly mentioned before. There's also a video presentation.

Intel Ct

Can someone point-out the difference between this and Intel ct hierarchical, synchronous tasks (HSTs)

Just from a cursory look at

Just from a cursory look at what Intel Ct is, they seem to be more ambitious about abstracting away parallelism on multi-cores, whereas BSGP is a less ambitious simplification of the CUDA model. The domains they are aiming at don't really overlap (Ct for multi-core, BSGP for GPU programming) in terms of their programming models.