Can parallel processing really cut it?

When Larrabee was first delayed and then “postponed” most of us weren’t surprised (, at least  I wasn’t).  Parallel computing, though advocated as a world saver, isn’t the easiest model to program to. Doing everything in “software” (graphics, HPC and all) ‘might not’ be as easy as was anticipated. The cold hard reality is that languages like C++, Java and derivatives (mostly OOP ones,) were never really designed for parallelism. A multi-threading-here and a asynchronous-there, doesn’t really cut it. Using the full potential of parallel devices is very challenging indeed. Ironically most of the code that runs todays software isn’t geared for parallel computing at all. Neither are todays programmers.

But experts advocate  a parallel computing model for the future. But, is it easy to switch to? Will an innovation in hardware design, or a radical new compiler that optimizes away your “for() loop” the real answer? A very interesting article to read (even if you are not into graphics and game programming) is :

http://www.brightsideofnews.com/news/2010/5/27/why-intel-larrabee-really-stumbled-developer-analysis.aspx?pageid=0

Very rarely do I quote articles, but this one is really worth a read. Well-written and well said.

Supercomputer@Home

The past few weeks has seen a spat of newswires from leading GPU manufacturers showcasing GPUs with potential supercomputing capabilities. Some even going as far as saying that a powerful GPU could be used to achieve the power of a supercomputing cluster. Building a supercomputer at home might sound like an impossible task, but the fact is, very soon you could well have one sitting on your desk. No, this is not one of those machines that is just a bit faster than the previous or current generation PC you already have. This machine could well have a mother load of computing power and it might not feel all that fast while running your GNOME, KDE or Vista UI; or for that matter your Office applications. In fact it might not feel any different at all to the average user. However, hidden beneath the UI exterior will be a system that could be a power monster, at least for applications that have been designed to take advantage of what has become the latest buzzword in programming; parallel computing.

There has been a lot of talk about how a supercomputer could be built using top line GPUs, but up until very recently GPGPU was a difficult thing to achieve. This was due to the fact that most GPGPU solutions in the past involved tweaking already existing graphics APIs (OpenGL and Direct3D) for GPGPU tasks. This method was fraught with problems, mainly due to the fact that graphics APIs were never designed with general purpose computing in mind. That’s why GPGPU technologies like CUDA were developed and with the advent of those and an almost exponential increase in GPU power, the Supercomputer@Home has become a reality.

No, this is not another entry that praises GPGPU, well it is, but not entirely. Let’s be fair, GPGPU is only a part of the solution. As I have repeatedly said in my earlier entries, GPGPU has tremendous potential when applied to the correct problem. If you were to believe GPU manufacturers, it would seem that the GPU is the answer to all the problems out there. The fact however remains that the GPU is only a part of the solution. GPUs are designed to address data parallelism very efficiently. The roots of this obviously lie in how graphics is processed. Graphics, or should I say modern game graphics generally require a lot of parallel data processing and GPUs have evolved to address precisely that.

The current generation GPUs have immense computing power, and most of us already know that. However, GPUs are not all that great when dealing with task based parallelism. Simply put, GPUs are not designed to run several different tasks concurrently. So for problems involving a lot of separate tasks that can benefit from simultaneous execution, GPUs are of little help. However, if you have a large set of data comprising of a lot of smaller data units, and want to perform same operations on those data units, then using the GPU could give you a huge performance enhancement. The solution would involve streaming the entire data on to the GPU, and executing the operation on all the units simultaneously. GPUs excel in such situations.

Which, brings us to  data streaming. Streaming data on to the GPU can be a costly operation and is best done for large data chunks. Streaming data too often and in smaller data chunks can usually offset the performance benefit the GPU has to offer. While this is not a limitation of the GPU per say and more of a limitation of the bus, data transfers to and from the GPU should be kept as little as possible to achieve maximum performance throughput. Graphics programmers are already aware of this. OpenGL and Direct3D both encourage programmers to send maximum data across to the GPU all at one time in what is called as “batching”. Both APIs advise the programmer to stall the GPU as little as possible, and in many instances even go as far as adding a memory/resource overhead to achieve a better throughput.

After reading the above paragraphs, one must invariably ask the question, “Can all problems be efficiently data parallelized?” The answer is no. Not every problem is solved by efficient data parallelism. A lot of problems can be, but many computing problems can’t effectively take advantage of data parallelism and therefore can’t take advantage of the GPU in general. Also if you have a small set of data and want to perform a lot of different operations on that same set, the GPU is of little help. Actually it is you good old CPU that is designed for task parallelism. While a lot of hype has been going on about the latest generation GPUs as potential replacement for supercomputing clusters, the CPU has been left on the sidelines. Or has it?

The next question to ask is, “What role will the CPU play in your Supercomputer@Home?” Well make no mistake about it, a very significant role indeed. In the time the GPU has been growing from strength to strength, the CPU has been having a transition of it’s own. While things have moved on from the days when clock speeds were treated as marketing tools, the CPU has itself seen some significant development. Although nothing to get exited about, the CPU cannot be ignored especially if you have to work on problems which involve multi tasking or multi-threading. Most non trivial programing problems require a mix of programming solutions. Some parts may require data parallel solutions, others may require task parallel. Then there are problems that can’t be solved by either. Therefore the CPU will still continue to play a critical role in any supercomputing system you may build.

It’s completely possible to build a supercomputer at home today. But, building a supercomputer is the easy part. Taking advantage of all this power is whole another story. It would involve an approach where the programmer will have to split his/her design so that data parallel parts could be executed on the GPU; at the same time multiple parallel tasks are executed on the CPU. Yes, it would imply a design that is similar to modern game engines, where data is sorted, batched and the entire data is sent across to the GPU to be processed by the rendering system. While this happens parts, or rather tasks of the application can be executed in parallel by conventional multi-threading or using parallel programming (OpenMP) to achieve maximum performance.

A machine with say a Tesla GPU or 2 HD 4870 GPUs connected via crossfire with a new and upcoming Core 7i CPU from Intel and oodles of memory could very well be a machine with supercomputing capability. But, where would one use so much power? Obviously computer games is one area. For a game developer like me, you just can’t have too much power. I always manage to stress out everything I have on my machines, however high end it maybe. But seriously, where else could you use all this power? Maybe for someone who wants to do heavy duty scientific calculations, could indeed benefit from such a machine. Another use could be compression algorithms, especially video and audio compression. People in video/audio editing business could also benefit from such setups. However, for the joe user so much power is all but useless.