DP flops on the cheap

06 09 15 - 11:44

I'm not the only one with piles of leftover hardware from cryptocoin mining days. Usually this is some AM3+ board with a couple of x16 PCIe slots, the cheapest cpu and a bunch of AMD (ex ATI) gpus. It is hard to give new and interesting life to such hardware, but with sufficiently motivating hobbies, there's a way.

Recently I came across an interesting project called DroneCFD. It's a bunch of python that does everything that needs to be done in order to put a 3D geometry of something that pretends to fly into a virtual wind tunnel and create some nice pictures with OpenFOAM and Paraview. Which is something that I wanted to do for some time now, since my hobby is fast developing into a proper sport (we already have world cup and world championship is within 6 years) and we'll want to have the best possible models.

So we need OpenFOAM. What's that? It's something called CFD, computational fluid dynamics. Actually it's a library of many different solvers for various fluid simulations, from laminar, turbulent, supersonic, recently it started to spread into multiphysics as well (heat, EM). It's completely opensource and popular in academic circles. There are many commercial tools out there that cover the same problem areas, but their cost is completely out of scope for hobbyists. If you want to explore CFD further, there are a couple of courses on youtube that can give you an insight into the math involved.

The gist of the problem is this: in CFD you describe your problem with complex sets of differential equations that you need to solve for each time step. When this is translated into numerical algorithms, you end up with large sparse matrices that you shuffle in and out of main memory. Since they're usually too big to fit into cpu caches, the memory bandwidth becomes your limiting factor. So if you want to run your simulations fast, you're looking for a system with highest possible memory bandwidth. And here is where our gpus come into play.

I've set up a system from my leftovers, got a second hand FX8320 cpu for it and cashed out for 32GB of cl9 1866MHz memory. Then I set up the whole optimized environment with the help of excellent EasyBuild framework. It consists of latest GNU GCC 5.2.0 compiler, AMD ACML 5.3.1 math libraries, OpenMPI 1.6.5 and OpenFOAM 2.3.1. With this setup I ran the simulation with example geometry, included with DroneCFD and it needed ExecutionTime = 36820.82s, ClockTime = 38151s to perform 3000 time steps of simulation.

Then I noticed that AMD receltny released clMath libraries on github. They implement some commonly used math routines in OpenCL 1.2 and 2.0, which means that you can run them on cpu, gpu or any other device that implements OpenCL. One nice thing about these libraries is that at least clBLAS includes a handy client and a bit of python scripts that enable one to do some benchmarking of the hardware. And that's exactly what I did.

First, I ran some tests to demonstrate cpu vs gpu difference. I used 7970 gpu here with OpenCL 1.2 version of the clMath libs. This is what I got:

cpu vs tahiti

X axis presents matrix size, Y is Gflops measured by the library. Here I performed single and double precision general matrix-matrix multiplication on cpu and on gpu. Y scale should almost be logarithmic to see cpu performance in more detail ;) There are a lot of interesting things worth more detailed discussions on this graph, but it serves a purpose I wanted to demonstrate - gpus are many many times faster than cpus for this kind of work. No wonder scientific communities are jumping on them like crazy.

Since 7970 is not the only kind of gpu I have lying around, I replaced it with R9-290, rebuilt clMath libs with OpenCL 2.0 and rerun the tests:

Couple of things to note here. R9 290 is based on a new, different architecture (called Hawaii) than 7970 (Tahiti). While the older architecture has about half the performance at double precision compared to single precision (which makes sense, as dp numbers take twice as much space in memory as sp numbers), newer architecture fails to reach the dp performance of the older one for most of the explored range. If single precision is good enough for your problem, then newer equals better. But with most of the engineering problems demanding double precision math, it turns out that previous generation of gpus offers more. 

There's one limiting factor with these gaming gpus: they have relatively small amout of memory. While r9 290 has 4GB, 7970 has only 3GB and these are both small if you want to run some decent numeric simulation. There are two ways to grow beyond that: first is to cash out for "professional" gpu products with up to 32GB or memory and then, if even that is not enough, distribute your simulation across many gpus and many systems with MPI. But that is beyond our hobby again.

There are two things I want to do for next step: first I want to run OpenFOAM linked with these clMath libraries and measure any improvement. I assume that copying data to and from gpu for each time step will kill any performance bennefits that gpu can offer. But I want to have something to compare to, as I discovered a company that ported exactly the solvers I'm interested in to run fully on gpu, only doing the copy at the beginning and at the end of simulation. Also they offered affordable prices for their work for us hobbyists so stay tuned for part 2 :)



22 12 14 - 16:20

With my curiosity and passion for innovation I often discover patterns around me that appear as pendulum motion, from one extreme to the other, back and forth. Plotted on a time line, they may appear as a sine curve, but looking from evolutionary point of view, a spiral is a much better representations. Sometimes those patterns even gravitate towards a common endpoint, such as a center of a spiral.

One such pattern is now in motion and starting to gravitate towards a center.


HPC world

18 10 12 - 14:32

Swimming in the HPC waters for the past two years, I got some sense of how wet and cold the water is.


Server vs. server

09 01 10 - 21:46

Some people like strong cars, I like powerful computers. Since I got on the internet in 1994 or so, this reads powerful servers. My last workstation oriented powerful computer was Pentium 200 in 1996 and it soon landed in the role of a server, with 4x 25GB drives in 1999. Btw, it still works perfectly ;)

Since home server is mostly limited by budget, one looks at price/performance factor. And since I have all the storage management (raid/lvm) expirience, I spent some time putting together a data-reliable machine, that serves the usual music/movies entertainment stuf, runs a samba share for few windoze boxen to share their data and a few NFS and AoE exports for my desktop. Since my desktop is diskless (actualy has no moving parts), it is very dependant on latency and io throughput the server offers it.

For a few years now this role was handled by a HP tc2120 "server" I got cheaply second-hand. Pentium 4, 64bit pci bus, looks like a decent IO oriented pc. I hang a pair of ide disks on onboard ide, a few more on addon ide cards and a bunch of sata disks on LSI 1068. All disks were in mirrored pairs (same size, different manufacturer) and joined together in one volume group. Out of that I carved various logical volumes for various tasks. One such volume was used for torrent downloads.

And how was I satisfied with it? It was barely useful. Every time software mirrors were doing a resync, latencies grow beyond anything I would consider useable. Every time a torrent was active, the whole system lagged. IO to one volume killed IO on others. I tried many things, like decreasing raid resync speed (helped only a little and slowed down resync time to a week or so), changing schedulers (no observable change), readahead settings (marginally better on read-mostly nfs mounts). At the end I concluded that it must be the interactions of all the storage layers that create artificial relations among diffrerent disks that are standing in my way and greatly decreasing the overall system performance.

Because I'm moving soon to a different house, I decided on setting up a less powerful server here and taking this "nice" HP box with me, configured differently, of course. I dug up an old slot1 Pentium 3 pc, put desktop gigabit nic and 3ware 6800 in it and stuffed it with disks. There's only one mirror now and 3ware takes care of it, all other disks are jbod, standalone, no lvm. Each disk has its own scheduler and readahead settings that depend on how it is used. Expirience: same throughput as HP "server", much better latencies, much better paralelism. Suprisingly this 12 years old pc is capable of sustained 50MB/s over ethernet to different disks, a number that I've only seen in bursts on HP "server".  Oh, and it also draws about half as much power as Pentium 4. And it serves you this page you're reading now.

 So much about "new and improved" technology. Complex setups are rarely better than simple setups. Cheap, fast, reliable: pick any two, you can't have all three. What more can I say? ;)


Sad state of linux filesystems

08 12 06 - 01:33

Faced with a need for a reliable, well performing file system, a linux admin today is in a bit of a trouble. Ext2 is still in the kernel just for academic purposes, ext3 is "good enough" for majority but uselessly slow for what I want from it, reiser3 is in "don't tuch if it's not b0rken" mode but suffers from BKL use which limit its scalability, xfs and jfs are nowhere near a reliable state and reiser4 is "almost there" but due to Hans arrest, one wonders if it will ever be finished.

The more I think, the more I see there are only two usefull options for me: use Veritas VXFS, which recently became free for "smaller" installs or dump linux altogether and go with ZFS in Solaris 10. Or maybe Nexenta.

One wonders if so much choice really does make sense ... because these days it looks like FS knowhow is too much spread out and not focused on making just one, but really good FS ... As much as I despise RedHat for sticking with just ext3 in their RHEL line, at the end it would seem that was the right longterm choice.

Now if only chunkfs would become a reallity sooner ...


Cyrus HA

10 03 05 - 03:21

There's more and more discussion about Cyrus high availability setups on the cyrus mailing list. Recent thread started with a familiar question of how to achieve best possible availability and drifted to the discussion about implementing an active-active application level redundancy into the cyrus itself. Since i love to play with disks, raids, volumes & co, I posted my raid-based sysadmin view on how would i try to implement it

  • 1