## DP flops on the cheap

06 09 15 - 11:44

I'm not the only one with piles of leftover hardware from cryptocoin mining days. Usually this is some AM3+ board with a couple of x16 PCIe slots, the cheapest cpu and a bunch of AMD (ex ATI) gpus. It is hard to give new and interesting life to such hardware, but with sufficiently motivating hobbies, there's a way.

Recently I came across an interesting project called DroneCFD. It's a bunch of python that does everything that needs to be done in order to put a 3D geometry of something that pretends to fly into a virtual wind tunnel and create some nice pictures with OpenFOAM and Paraview. Which is something that I wanted to do for some time now, since my hobby is fast developing into a proper sport (we already have world cup and world championship is within 6 years) and we'll want to have the best possible models.

So we need OpenFOAM. What's that? It's something called CFD, computational fluid dynamics. Actually it's a library of many different solvers for various fluid simulations, from laminar, turbulent, supersonic, recently it started to spread into multiphysics as well (heat, EM). It's completely opensource and popular in academic circles. There are many commercial tools out there that cover the same problem areas, but their cost is completely out of scope for hobbyists. If you want to explore CFD further, there are a couple of courses on youtube that can give you an insight into the math involved.

The gist of the problem is this: in CFD you describe your problem with complex sets of differential equations that you need to solve for each time step. When this is translated into numerical algorithms, you end up with large sparse matrices that you shuffle in and out of main memory. Since they're usually too big to fit into cpu caches, the memory bandwidth becomes your limiting factor. So if you want to run your simulations fast, you're looking for a system with highest possible memory bandwidth. And here is where our gpus come into play.

I've set up a system from my leftovers, got a second hand FX8320 cpu for it and cashed out for 32GB of cl9 1866MHz memory. Then I set up the whole optimized environment with the help of excellent EasyBuild framework. It consists of latest GNU GCC 5.2.0 compiler, AMD ACML 5.3.1 math libraries, OpenMPI 1.6.5 and OpenFOAM 2.3.1. With this setup I ran the simulation with example geometry, included with DroneCFD and it needed ExecutionTime = 36820.82s, ClockTime = 38151s to perform 3000 time steps of simulation.

Then I noticed that AMD receltny released clMath libraries on github. They implement some commonly used math routines in OpenCL 1.2 and 2.0, which means that you can run them on cpu, gpu or any other device that implements OpenCL. One nice thing about these libraries is that at least clBLAS includes a handy client and a bit of python scripts that enable one to do some benchmarking of the hardware. And that's exactly what I did.

First, I ran some tests to demonstrate cpu vs gpu difference. I used 7970 gpu here with OpenCL 1.2 version of the clMath libs. This is what I got:

X axis presents matrix size, Y is Gflops measured by the library. Here I performed single and double precision general matrix-matrix multiplication on cpu and on gpu. Y scale should almost be logarithmic to see cpu performance in more detail ;) There are a lot of interesting things worth more detailed discussions on this graph, but it serves a purpose I wanted to demonstrate - gpus are many many times faster than cpus for this kind of work. No wonder scientific communities are jumping on them like crazy.

Since 7970 is not the only kind of gpu I have lying around, I replaced it with R9-290, rebuilt clMath libs with OpenCL 2.0 and rerun the tests:

Couple of things to note here. R9 290 is based on a new, different architecture (called Hawaii) than 7970 (Tahiti). While the older architecture has about half the performance at double precision compared to single precision (which makes sense, as dp numbers take twice as much space in memory as sp numbers), newer architecture fails to reach the dp performance of the older one for most of the explored range. If single precision is good enough for your problem, then newer equals better. But with most of the engineering problems demanding double precision math, it turns out that previous generation of gpus offers more.

There's one limiting factor with these gaming gpus: they have relatively small amout of memory. While r9 290 has 4GB, 7970 has only 3GB and these are both small if you want to run some decent numeric simulation. There are two ways to grow beyond that: first is to cash out for "professional" gpu products with up to 32GB or memory and then, if even that is not enough, distribute your simulation across many gpus and many systems with MPI. But that is beyond our hobby again.

There are two things I want to do for next step: first I want to run OpenFOAM linked with these clMath libraries and measure any improvement. I assume that copying data to and from gpu for each time step will kill any performance bennefits that gpu can offer. But I want to have something to compare to, as I discovered a company that ported exactly the solvers I'm interested in to run fully on gpu, only doing the copy at the beginning and at the end of simulation. Also they offered affordable prices for their work for us hobbyists so stay tuned for part 2 :)