16th of May, 2016
I've been working on a project here at QMUL for a while now, utilising Fortran90, CUDA and a library called OP2. It's been quite a ride really, with lots of things to learn. I've finally had a bit of a breakthrough getting things working, so it seems like a good time to share some thoughts.
Fortran remains popular in the science fields. I think there are a few reasons for this. Firstly, later versions of Fortran are actually not so different from C, at least to programmer. Fortran is more setup for maths work. One clue is that it's array indexing actually starts from 1! Can you believe it? I know right? We all use 0 as the first index into an array or matrix as programmers, but then we get told off by these mathematicians who much prefer using 1 as the first number. There are other things built into the language that are based around the ranks or shapes of multidimensional arrays but overall, it's a compiled language that appears to have the longest lineage of all languages. I've heard many horror stories of Fortran77 and I certainly wouldn't want to be using that language today. When something works, academia is loathe to change it. This can be a benefit in certain circumstances, when it's the science that matters, but it does put an extra load on people like me who need to learn the language. Good thing I enjoy it!
Fortran appears to have another up-side; automated differentiation. A program called Tapenade exists which takes any function, and generates a derivative of that function. So if one has a function generating a value, you can automatically create a first or second order derivative from it. I'm still not sure why Fortran is amenable to this, or how it works. It really does sound like voodoo to me, but impressive non-the-less. This sort of thing, I'm told, is very helpful for people working in fluid dynamics research.
The actual project I'm working on, is porting a fluid dynamics solver to run on a CUDA capable card. More specifically, we are looking at new ways of parallelising the code base to take advantage of all the new, sweet modern hardware! Rather than write CUDA code directly, we have been asked to trial a library called OP2. This library takes the approach that all the data should be organised into sets, with mappings between sets and data attached to items in the set. Some strict rules need to be followed, such as not having two set items, through indirection, pointing to the same data item, in order to avoid race conditions. However, if you plan things out, you can then call OP2's parallel loop constructs. OP2 does a clever job of either simulating on a single processor, splitting the work with MPI, OpenMP or CUDA with very little additional code.
A large part of the work involves flattening a lot of function calls to create a single function, or kernel, that is executed in parallel, with the CUDA kernels being slightly different to these for OpenMP or MPI. The data in our original program is organised quite differently so that has required quite a lot of re-organisation. After this rather laborious process, I finally managed to get a version running on one of our Tesla cards here at the university. We achieved a speed that was equivalent to about 12 Intel i7 cores or thereabouts. Fast, but nowhere near as fast as it could be.
One of the cool things about the CUDA tool kit are the profiling tools that come bundled in. nvprof is a really cool little program that provides some really useful stats, and runs on the command line. The Visual Profiler provides an awful lot of advice, with pretty graphs and suggestions on how to improve your code. It turned out, as I thought, that we are copying to and from the GPU too often, and incurring all the speed penalties you get with the PCIe bus. So how best to improve on that?
The first thing is to try and load all the data once, and perform all the problem computation on the GPU. This should be possible and it's my next step. It means writing even more kernels; squashing down functions into one monstrous function which seems like a pity. This does mean we need a hefty GPU with enough memory to hold the entire problem space, but then Tesla cards are fairly beefy these days. If you ever do get to a point where you can't fit the entire problem set onto a card, you can use the new asynchronous streaming api and load data whilst still performing computation on what you have in memory, effectively pipelining the process.
If you are using Python, C, C++ or any of these languages, you have access to many compilers and tools. With Fortran however, nvidia were caught on the hop! Realising there was a large Fortran community out there, nvidia bought the Portland Compiler (indeed the whole company) and began work on adapting it to work with CUDA. Unfortunately, this means you need to buy a licence if you wish to use it past a year, which is a shame. I've been told there are a few more bugs in PGI than you might be used to, but it seems to perform ok for me.
We've changed hardware just recently, moving to a newer GeForce card, with the latest Ubuntu and PGI compiler. Unfortunately the code is fragile and the build no longer works. Problems like this are very closely tied to the hardware; you really do need to be concerned with how floating point precision works on this hardware and whether or not you need doubles or floats as we are dealing with engineering and scientific simulation. One piece of software is coming to the rescue though - Docker. Images of certain operating systems and CUDA versions are available, so building and testing on different platforms becomes possible.
I still know very little about fluid dynamics mind you, though it does sound really interesting! :D