Multi-Threaded Challenges in the Game Space – a Conversation with Tom Leonard of Valve Fame
A while ago I posted some comments on the press coverage of an event by Valve, where they explained threading in the Source Engine to a couple of tech-journalists. Unfortunately, the coverage left many open questions and some things even appeared to be wrong. Fortunately, Tom Leonard, the threading guru from Valve is a nice guy and agreed to answer some of my questions via email. This post contains the conversation we had.
Michael: Both articles claim that existing threading engines were inefficient when it came to games. Inefficient is a broad description, which engines did you evaluate what exactly made them unsuitable for your game? Latency? Thread-Switching-Costs? Some other kind of overhead? Or something else?
At the OS level, while we’re obliged at times to use OS features to play nice (e.g., Windows events), we generally avoid them unless totally necessary because thier cost (in time) is very hard to predict. For example, a hand-built mutex using an atomic compare exchange spin loop with a custom back-off is way more efficient and predictable than a Windows critical section.
Michael: How much work is involved in having to maintain your own threading engine along with the game? As far as I know, the source engine even runs on multiple platforms (e.g. on the Xbox, which is built on PowerPCs if I am not mistaken), which makes the situation even more challenging, right?
So the underlying framework on the 360 is a subset of the PC. The 360 tends to differ in that on the 360 the code being threaded tends to have a stronger opinion of which core it should be assigned to.
Michael: Maybe I don’t understand enough about game internals and the challenges you face in game programming to really understand that decision. What exactly is so different about games that you had to invent your own system for it? Scarce resources (especially on consoles)? Soft realtime requirements? Or something else?
Futhermore we know so much more about what we’re doing at a macroscopic scale that we can generally make much better decisions about when and where to execute code than a general purpose library could.
Michael: How does your threading-engine cope with these problems? From the articles it appears you are working on a very low level (assembly) to solve the problems – but the usual threading engines can do that as well…
Michael: Is the assumption I made in my comments that you are not using lock-free algorithms correct?
Based on these primitives I’ve built a set higher-level systems to help programmers express things in a lock-free manner. Sometimes on Windows the code has to go into a wait state to cooperate with other processes. But on the 360 the code can run full-throttle lock-free. I’ve also used these tools to make all of our core systems (allocators, resource management, world representations) lock-free.
In the case of the presentation last month, for the purposes of communicating to the layperson, it was easier to overload the term “lock-free” in a general sense than the more specific “lock-free”, since the former presupposes the latter, and the latter is only interesting if you’ve committed to the former as an design goal.
I’ve never really tried to articulate this idea, so forgive me if this doesn’t make much sense. I’m of the opinion that the very strict “lock-free” definition is interesting for people implementing the building blocks, but only marginally useful to general programmers. Really applying lock free algorithms requires a corresponding code and data design philosophy in higher level code, which is where I’ve been spending most of my effort.
For me personally, at times it reminds me of the sort of mental shift I had to go through in the late eighties as OO started surpassing structured programming. To be successful you have to get your head into this space of non-deterministic code whose reality might switch from instruction to instruction, where you leverage inference rather than locks to know part of the system is stable.
For example, the work distribution system sends prioritized jobs out to pools of threads or specific threads. Using a pairing of a set of lock free queues and an interlocked integer it’s able to support multiple providers and multiple servicers with good scalability. While the lock free queues and stacks were a precondition to achieving this, the most interesting step was understanding how the API needed to express itself to allow it (e.g., a few discrete priorities), and how the worker code needed to be structured (i.e., make the fact that it gets dibs to do a work unit disjoint from the actual unit it ends up processing).
I’m sure this sort of stuff is well understood in domains where threading has been prevalent. It’s new to gaming.
Michael: This I find very interesting. It sounds like you ended up implementing a whole lot of functionality lock-free. Personally, I have only very limited experience with lock-free programming (I generally try to stay portable, because we have access to a lot of different machines – a goal that does not mix well with lock-free programming, especially since I have not found any good and portable lock-free libs). I have done an experiment with a small lock-free library some time ago, however and compared a lock-free taskpool with a hand-written one (using in part thread-local storage for the latter to avoid locking at all) and the results under heavy load were strongly in favor of the hand-written version – this convinced me at the time that lock-free was not something I had to get into immediately – these CAS-ops still do take a lot of time. Maybe I will have to revisit that decision sometime, especially since I was of course comparing apples and oranges, since the lock-free version did not get to use thread-local storage.
But related to this, I have another question: Avoiding locks in the first place by making certain parts of the problem private to each thread is a technique that has worked for me a lot of times. I could imagine that it could work in a game engine as well, e.g. by partitioning the world into smaller pieces that only one thread ends up processing. Are you doing something similar? Or is that not feasible for an FPS?
Another pattern that seems to recur is some sort of lock-free system to get a “ticket” for a given instance of work or of a resource, and then run that work in on an independent instance represented by that ticket. The pool of instances may be limited, but with the right tuning any real contention can be limited.
Michael: My previous questions were all about the internals of your threading-system, but obviously a lot of work went into making it easy to use, an area I personally find very important and challenging. From a very broad perspective, what did you and your users find so easy to work with in the system? Or asked differently: What does your threading system look like from a users perspective? Or were you even able to hide threading from most developers – and if yes, how?
A simple one-off push to another core looks like this:
- if ( !IsEngineThreaded() )
- _Host_RunFrame_Server( numticks )
- else
- ThreadExecute( _Host_RunFrame_Server, numticks );
A simple parallel loop:
- extern void ProcessPSystem( CNewParticleEffect *&pNewEffect);
- if ( particlesToSimulate.Count() )
- ParallelProcess( particlesToSimulate.Base(),
- particlesToSimulate.Count(), ProcessPSystem );
Queue up a bunch of work items, wait for them to complete:
- BeginExecuteParallel()
- ExecuteParallel( g_pFoo, &CFoo::Func1, arg1 );
- ExecuteParallel( &Bar, arg2, arg3 );
- EndExecuteParallel()
Semi-manual. This one is interesting as the code has used a similar template techniques to create a queue of journaled graphics calls and then executes the whole batch on another core:
- m_pActiveAsyncJob = g_pThreadPool->QueueCall( this,
- &CMaterialSystem::ThreadExecuteQueuedContext,
- pPrevContext );
Very manual:
- CJob *pJob = new CFileAsyncWriteJob( pFileName, pSrc,
- nSrcBytes, bFreeMemory, bAppend );
- m_pThreadPool->AddJob( pJob );
Now, I realize that this isn’t much to look at. But it is pretty straightforward to most programmers, and underneath the hood (in CThreadPool) there is a unified set of profiling and debugging tools that programmers can leverage.
Michael: Since I am also a member of the OpenMP language committee, I cannot resist the temptation to ask: did you consider using OpenMP? And if yes, what kept you from doing so? I could imagine coupling the ease of use of OpenMP with a custom-built OpenMP-runtime-library, optimized for games could be a viable approach to overcome some of your problems – but then again, I don’t really know enough about the challenges of game programming to have more than a speculative opinion on that :-).
Also, the type of problem OpenMP is good at, loop parallelization, is by far a subset of the types of ways we want to do distribute work.
Finally, we have a category of programmer here who are really game designers who program, and I felt the decoration of OpenMP would be a barrier for some of those people. This is exacerbated by the fact OpenMP uses pragmas, which means more complicated usages can’t be disguised behind preprocessor macros, the common way of mitigating that sort of issue.
Michael: Thanks Tom, that was very enlightening!