Invasion of the multi-core machines.

They are here. They were already here, but now they are really here and can no longer be ignored. Developers and programmers, especially game developers, can no longer afford to sit back and just watch the invasion of machines with multi-core processors. While hardware manufacturers have scaled their processes to bring them to us, software and compilers haven’t scaled equally well. Ironically with current programming methodologies, programming practices and compilers, programmers can’t yet take full advantage of the CPU power thrown at them. It’s not entirely the programmer’s fault, and neither are they limited by their intelligence. The fact is current generation programming models fall short of addressing the multi-core issue in a reliable way. Yes currently there are workarounds and you can definitely use them to get some advantage on a multi-core machine. Applications using even simple threads can benefit from multiple cores, but merely having multiple threads in an application doesn’t mean that application or the threads will run at twice the speed for a dual core CPU. The performance boost for a “normal” multi-threaded application running on a multi-core system will be rather minuscule compared to the computing power of what a multi-core system provides. To be frank all those cores are of little use from the “average joe” programmer’s perspective. That’s just because, if not programmed with care, the advantages provided by those cores is useless, at least in one given application.

Lets look it from a more technical perspective. Most multi threaded applications are not written to use multiple concurrent threads aggressively. A typical multi-threaded application delegates only a minuscule portion of it’s program code to a thread, often called a worker thread. Generally the most CPU intensive or blocking operations in a program are done inside this thread. Consider an example of a web-downloader application. You have one worker thread doing all the downloading while the UI thread waits and processes user input like “Cancel” or “Quit”. Here the program was explicitly designed this way so that the UI can respond to user input, while at the same time the task of downloading a file can go on. In other situations threads may be used for a different purpose. Take the case of the Doofus game. In the game shadow calculations are the most CPU intensive operations, but are not required per frame (, or cycle). So the shadow calculations are done inside a thread at lower priority, typically the calculations are distributed so that the results are obtained per 2 or 3 frames. Whatever the case maybe, the fact remains, such designs are not the most optimal way to program for multi-core machines. In the case of the web-downloader application, one thread waits while one thread does all the work. In the case of the game the situation is a little bit better, but still the task is not optimally distributed between threads. The ideal case would be to have multiple threads share the entire workload of the application so that all the threads are busy all the time. If that were indeed the case and if these threads were to run on separate cores, you would then be able to harness the true power of a multi core machine.

Programming multiple concurrent threads in an application is difficult. Thread synchronization is not for the faint hearted and having a lot of threads in an application can create bugs that are difficult to debug. Traditional multi-threading is synchronized using lock-based methods and lock-based synchronization is prone to problems like deadlocks, livelocks and race conditions. Experienced programmers will try to avoid multi-threading if and when they can and try for simpler and often single threaded solutions. This is not what is advocated by concurrent computing and parallel programming, which clearly can take advantage of multiple cores very effectively. It is true that even with current multi-threaded designs you could benefit from multi-core architecture. Even if such programs internally wouldn’t be able to use the power of multiple cores, the OS can still make full use of the architecture for multi-tasking. To put it in simpler language, multiple applications running at one time will run faster (, note the subtle difference there). For a typical system, applications like anti-virus programs and background processes along with other user applications, running all at once, will definitely benefit from additional cores. This however isn’t very helpful from a game development perspective, since obviously most games are single applications. Games typically take up most of the system resources while running and the advantages of multi-tasking are all but useless while running a game. Games therefore must be able to harness the true power of multiple cores internally. What does that mean? Does it mean a paradigm shift in how games are built? Do we need a special language to do so? Should we shun the current programming practices? Drop C/C++ and look at maybe Erlang or Haskell? Use parallel programming concepts? The questions are many and are increasingly asked by a lot of people. The truth and the solution however, is not quite so simple.

Parallel programming (or parallel computing) is difficult to implement correctly for most non-trivial applications. I am no expert in parallel programming. However from how I look at it, it is difficult to program for full parallelism in a conventional application, and hope to get a performance enhancement. Parallelism advocates the breakdown of a programming problem into independent tasks which can be executed at one time and in parallel. This is not however quite that simple. First, it may not be entirely possible to do so. Not all programming tasks can be broken into independent sub-tasks. For example, a CPU intensive algorithm which has a very liner and result dependent flow can’t be effectively parallelized. Second, and a very difficult problem to overcome is the problem of race conditions. If two tasks are critically dependent, race conditions are unavoidable. To factor in this would mean a loss of performance and could mean losing the multi-core advantage. However let us not be so quick to discount parallel programming. Parallel programming could be used for very special purpose programming, or maybe,,, OK hold a sec there! All you graphics programming guys! Are we not already using parallel programming in our games? Yes we are, and we have been doing it for quite a while now. Actually we have been doing so ever since we have had programmable hardware for graphics. Shader programming is an excellent example of parallel processing. The graphics industry has successfully exploited multiple cores and parallel programming with even main-stream graphics cards today having upwards of 100 shader units. Yes, each shader unit can be thought of as a separate core. GPUs are sometimes and in some instances more powerful than CPUs and can be used for parallel processing because they implement shaders. People are using GPUs to do all kinds of stuff from image processing to running complex AI simulations, check out GPGPU. A similar approach could be applied to next-gen graphics engines using the CPU with multiple cores. It is an interesting thought but an equally challenging one at that. CPUs are not GPUs but such approach could work at least in theory. As I said earlier “fully” parallelism in a program is difficult. However if we could use the approach used in shaders, we could very well parallelize some specific portions of a program. If you are interested in parallel programming, read this entry of mine. There are some libraries and frameworks mentioned which you can use.

Traditional lock-based systems are fraught with problems and require careful planing and implementation from the programmer’s side. Their technical complexity makes them unpopular choices. So the question to ask is, are there really better alternatives to lock-based synchronization? The answer is yes, there are, and they are called, well, lock-free concurrency control mechanisms. One of the most interesting lock-free concurrency control mechanism is STM (Software Transactional Memory). STM is a good choice because compared to lock-based methods it’s a no-brainer! Well almost. I am not getting into details about implementation of STM, but just a brief explanation is I think due. With STM you don’t have to deal with all the problems typically found under lock-based mechanisms. Things like deadlocks, livelocks, race conditions are totally avoided. Under STM a thread just simply writes to a shared memory block regardless of what is happening with others. It could happen that another thread is also writing to the same memory block at the same time. However STM believes that this will rarely happen and if it does, the runtime is smart enough to revert the changes and rerun the transactions one by one until they succeed. However all this comes at a price and a lot of book keeping goes on inside the STM runtime. Keeping track of every single transaction would imply that STM would be rather slow, well it is, but only for now. However, as the number of cores and threads increase, STM performance increases exponentially. So though STM may not be all that great for 2 or 4 core processors, it will be a great choice when we have processors with say 20+ cores. Before I end this note on STM I would just like to add, Intel has been actively developing a C++ STM compiler, you can take a look at it here.

So, as it stands today, we are still in a murky situation when it comes to parallel computing. It is still difficult to program for true parallelism with current generation languages. Yes, some languages like Erlang are very good at parallel processing, but Erlang seems to be a very domain specific language, and rather unsuitable for gaming in general. (Yes you could write a very efficient server in Erlang, or have some other specific functionality written via that language.) Haskell on the other hand is very nice platform to implement concurrent programming. However if you are successful at learning Haskell, let me know how! I tried it, and am still trying it. There is an STM implementation for Haskell, you are welcome to try it. For now however the best you have is OpenMP which can be used if your compiler supports it. It has been used by some successful games, so I have read (, can’t seem to find the link). OK before another power surge erases this entry, let me just end this long post by saying; It’s still pretty unclear which direction we are headed. Some STM implementations look to be the most promising option right now. It will be interesting to see which technology, method and language finally wins and what we adopt in the end.