A Short Guide to Mastering Thread-Safety
When I am talking on forums or to some of my students, from time to time I realize that I have gotten so used to expecting a certain vocabulary and certain concepts to be known to others that it isn’t even funny. In German we call this being betriebsblind, which basically means nothing more than expecting everyone to know what you know (at least to a certain extent), because the knowledge is so essential for what you do every day. The concept of Thread-Safety definitely belongs into this category for me, as I usually have to force myself to explain it and its implications to my students new to parallel programming, because it seems so trivial. Unfortunately, it is not. Therefore, I will spend the rest of this article trying to explain what Thread-Safety means, why it is important and also highlight some ways to achieve it without throwing your performance out of the window. I would also like to issue the usual warning to my more advanced readers: the beginning of this article is most definitely not for you, the end may be worth checking out as well :P.
What is Thread-Safety?
One of the defining properties of threads is, that multiple threads have access to the same address space. Therefore, they are able to write to the exact same memory location at the same time. I have explained why this is a problem in one of my earlier articles about mutual exclusion, therefore I can use the short version here:
Writing to the same memory location from multiple threads at the same time without any form of protection is like playing Russian Roulette: you may survive at first, but sooner or later you run out of luck and the results are disastrous.
You may quote me on that 8). For a more thorough explanation, please read the former article. Now that we have this part established, the rest is easy:
Thread-safe code is basically any code that can be called from multiple threads concurrently without breaking.
Or phrased differently: if your code does not access the same memory location from multiple threads without protection, it is most likely thread-safe. We can have thread-safe functions, thread-safe objects or thread-safe programs, although the usage of the term is most common when talking about functions.
So much for the easy part. The difficult part is actually spotting these accesses to the same memory location in your code and fixing them without loosing too much performance. But contrary to what some message-passing fans may want you to believe, this is not impossible. My students usually get it right after we have shown it to them in their programs once or twice. And with experience you will notice the little voice in the back of your head that warns you about it every time you want to write to a shared variable or call a function that may not be thread-safe. And if you don’t want to bet your career on little voices (probably a smart move on your part), let’s not forget about the high quality tools that are available to help you with this as well, e.g. the Intel Thread-Checker or the Sun Data Race Detection Tool. I have had very good experiences with the former tool and as soon as I find some time I will also try out the second one (especially since it is free – on the other hand the Intel tool is very reasonably priced as well, especially for universities).
How-to Achieve Thread-Safety
OK, so there are ways to reliably find the dangerous spots in your code. What do we do, once we have found them? Lets say we have isolated a function that is not thread-safe because it was written at a time when parallelism was not an issue. We have multiple ways to deal with it, some of which are highlighted in the following list:
- Put a big, fat lock right at the start of the function and unlock it at the end of the function. This is the easy solution, because it is doable in every threading system I know and it always works. Unfortunately, it is also the one which may cost you the most performance, since only one thread can carry out the function at the same time now. On the other hand, I have written the magic word may in the last sentence not without a reason: if the function is only rarely called, that big, fat lock may not be a problem for performance at all. Your profiler will tell you, especially if it knows about parallelism.
- Put one or many smaller locks in the function around the data that actually need protection. This finer grained approach may work better than the one sketched above, especially when there are only a few places in the function that actually need protection and there are larger areas that are thread-safe and can be carried out concurrently. It may also turn out to be a bad idea (TM), because setting and unsetting the locks also takes some time, even when there is no contention. As you can see by my repeated usage of the fine word may, there is no way to know which technique is better without taking a look at the actual code, all I am telling you here are just ideas which may or may not work.
- Protect the data instead of the code. No, I am not senile (yet), this is actually a different advice than the one I just gave you. Instead of protecting a small region of code with a lock where critical data is accessed, associate a lock with the critical data in question. This may allow you to use two different locks for the same region of code but when working on different data, allowing concurrent execution by multiple threads. An example: When you have written a function that multiplies each element of an array by two, you can associate a different lock to different arrays, allowing two threads to carry out the function concurrently when working on two different arrays.
- Use atomic operations to change the data that need protection. The availability of these operations varies from system to system, e.g. POSIX Threads does not have them at all, OpenMP gives you the possibility to mark certain simple operations on plain old data-types as atomic via the atomic-directive and in Java you can declare certain variables as atomic and all operations involving that variable will be carried out atomically. Check the documentation of your parallel programming system of choice and if you have atomic operations, it is most likely a good idea to use them (gosh, I cannot believe I wrote that whole bullet point without using may even once! :))
- Change your functions to only work on thread-private data. This may sound harder than it actually is, but many times shared data are not really necessary. Even if they may seem necessary, there is little trick that can change this quite often: delegate the shared data to the caller of the function, which passes it in as a parameter (most likely a pointer or a reference in C++) and document that the caller is responsible for protecting that data before the function is called. This may sound not like a solution, but rather like a delegation of the whole problem. But considering that the caller may have more knowledge about the present state of the program and its data, it makes a whole lot of sense. As an example, the caller may know, that no other thread presently calls that function with the same data and therefore no protection from concurrent access is necessary. Of course, the moment you employ this last technique, the function is not thread-safe anymore, so make really sure to document this properly! It also changes the interface of the function, which may not be doable for your particular case.
Further Reading
If you want to know more about thread-safety and related topics, I wholeheartedly recommend what I like to call the POSIX Threads Bible by David Butenhof – Programming with POSIX Threads. Or if you are more into Java, I hear Java Concurrency in Practice is a very good read. I have not really found a great book on Thread-Programming with C# – the best one appears to be Programming C#, although it contains only a single chapter on threaded programming. If you find a better one, please let me know. All of them contain useful information on Thread-Safety as well.
Summary
This closes my little trip into the world of Thread-Safety. I hope to have brought across, that Thread-Safety is a challenging problem, but can be managed with a little experience and the right tools. I also tried to show you some techniques, that may (that was the last time I used that verb for today, I promise :P) be useful to achieve Thread-Safety for your code.
The name-patron of my old high-school Wilhelm Ostwald had a famous maxim, which can be (roughly) translated into English like this:
Good theory must soon become practical. That is how its value can be seen.
Following this spirit, I am planning on writing a followup article on this topic, explaining how the techniques introduced here may (guess I lied :twisted:) be used to make the rand()-function thread-safe (and explain in the process, why its performance often sucks in multi-threaded code).
P.S.: This essay was sponsored by the word may, thanks for listening! 😉