Threaded Building Blocks: Intel's James Reinders on writing parallel code for C++

With multi-core CPUs becoming the norm, more developers will be grappling with a problem that has been around since the early 1960s: how best to write code that takes full advantage of multiple processors. These days, the number of parallel computing environments seems endless. On a company blog, Intel researcher Tim Mattson named some them: MPI, OpenMP, Ct, HPF, Erlang, Shmemm, Portals, ZPL, BSP, CHARM++, Cilk, Co-array Fortran, PVM, Pthreads, windows threads, Tstreams, GA, Java, UPC, Titanium, Parlog, NESL,Split-C. "The list," he wrote, "goes on and on. If we aren't careful, the result could very well be a 'choice overload' experience with software vendors running away in frustration." His conclusion: "We need to spend less time creating new languages and more time making the languages we have work."

That approach of starting with the familiar is taken by TBB-Threaded Building Blocks-which uses C++ as a foundation for parallel development. The language has become a favorite for developing processor-intensive visual computing applications, including games, that will be a mainstay for Intel's "many-core" Larrabee architecture. Intel first released TBB in mid-2006, then open sourced it in July 2007, licensing TBB under the GNU General Public License.

To learn more about TBB, I spoke with James Reinders, a software engineer and director of marketing for Intel's Software Development Products. Reinders' 2007 book Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism (O'Reilly Media) was published in Japanese in early 2008 and in Chinese last winter. Reinders joined Intel Corporation in 1989, where he has specialized in high performance parallel computing, including development work on the ASCI RED, first TeraFLOP supercomputer. As one of Intel's parallel computing evangelists, he has traveled the world speaking to software developers whose range of experience reflects the broadening interest in harnessing multi-core technology.

Let's start with the basics: why did Intel develop Threading Building Blocks?

Our goal has been to help programmers write more effective parallel programs in C++. People have been writing parallel programs in C++, as well as C, for some time-but they've been doing it at a pretty low level of abstraction. The problem is that, like some programs written in assembly language decades ago, these applications don't hold up as instruction sets change. The advantage of higher level languages like C and C++ is that they can and are still running 20 or 30 years later.

But C++ wasn't designed for parallelism, so it suffers in a number of areas. Algorithms are what usually come to mind: programmers worry about how to write multiple 'while' loops that are dispatched in parallel. Another problem area is container classes. In C++, the standard template library provides a number of data structures, like hash maps and queues. There's a fine art in writing data structures that can be used by multiple parts of a program in parallel without creating a lot of contention. One of the ways is to essentially say: "if I'm working on it, you can't work on it." But if you do too much of that, you create too much contention that will slow down the other threads. The more independently you can do these tasks, the more parallelism and scaling you get. Writing data structures that do this is not something every programmer should have to worry about. It's something that should be supplied by the language. TBB addresses this.

A third area we address is memory allocation. Here again, the libraries that come with C++ weren't designed to do this particularly well, so memory allocation becomes a source of contention. So we've made a thread-safe scalable memory allocator as part of TBB. We also did some simple things like giving you the ability to lock a data structure while you are updating it. If you are working in Windows, you can use a critical section object; in Linux you would use a POSIX mutex. But with C++, there hasn't been a portable, language-specified way to do locks. So we added those.

To be fair, some of this is being worked on elsewhere. For example, the idea of portable locks has been proposed to the C++ Standards Committee. What we've attempted to do with TBB is address this all on one place. There's always more work to be done, but TBB represents by far the most comprehensive solution for pulling C++ forward into being able to handle parallel programs.

What do you say to Tim Mattson's assertion that parallel programming suffers from a "choice overload"?

I think that's fair enough. We already have choice-overload in programming and parallelism will only add to it. Over time, we'll learn which solutions to focus on. TBB is specifically designed to address all the parallel issues for C++, instead of just haphazardly solving a few problems. I'm not aware of any comparable solution to add parallelism to a C++ program. We do have a research project called Ct that will also extend C++. Instead of being broad and comprehensive like TBB, it focuses on a narrower problem involving data parallelism. We hope this might lead to some solutions that are simpler for data parallel programmers than a full TBB implementation. Time will tell.

TBB makes a point of distinguishing "tasks" from "threads". What's the difference?

This is an important part of what I mean by good abstraction. When you are writing a parallel program, most programmers think in terms of hardware threads-each of which the hardware can do at once. But the question beyond that is about load sharing: how to ensure that each thread is accomplishing about the same amount of work. And about what to do if, say, a program designed for a quad core processor runs on an eight-core processor?

If you are thinking only in threads, moving parallel code between these processor environments turns out to be a headache-because we tend to discover parallelism in a program in a nested fashion. At a high level, we might identify three or four things that can be done in parallel. Maybe one of those things is to process 100 images, each of which might constitute enough work to be done in parallel, as well. But if you already broken your program up into threads, how can thread number two, which is bogged down processing images, share its work with thread three?

Over time, we've learned that you shouldn't be writing in threads. You should write in something more abstract that we call tasks. You should be free to create more tasks than there are processor cores or hardware threads, and you should just express parallelism when you see it. For instance, when I want to do a "for loop" in parallel, I should just be able to identify it as such and leave the rest to the compiler. I create a bunch of tasks, I don't care how many, that get the job done. In other words my job as a programmer is to identify tasks that can happen in parallel.

Without having to worry about how many cores are available?

Absolutely. When I'm teaching classes on programming, I tell people that if you have a piece of code that declares the number of available processor cores, you've already failed to do the right thing. Instead, think of your mission as a parallel programmer as identifying work-task by task-that can be parallelized. That's easy to do. Whereas the job of going and finding threads and trying to map them onto a processor-that gets really ugly.

How does TBB accommodate tasks?

Everything that you specify in TBB is at the task level. Then TBB has a runtime that maps them onto threads. A single parallel "for' loop" could potentially generate a thousand tasks. At runtime, TBB will divide that task and start allocating them. As a processor core finishes one task, it goes back to the queue for more work to do.

With this approach, we can then do load balancing, which turns out to be incredibly important because you are never sure which task entails more work than the others. If every processor in the system has to go look for work in the same queue, that in itself becomes a global bottleneck. So TBB has a task scheduler that works with multiple queues. The scheduler is sometimes referred to as a "stealer" because it steals work from other queues, using some really nice algorithms that are highly scalable and don't cause contention.

Over time, we expect that operating systems themselves will have task stealing schedulers built in and implemented in a standardized way, so that multiple languages can us them. In fact, we've been working with Microsoft on what they are calling the Concurrency Runtime. Right now it's a bit of a free-for-all, but TBB is a really big step forward.

So the larger goal is for programmers not to have to assume what hardware resources are available to a program, now and in the future.

Absolutely. Programs should not be written with specific hardware in mind. You don't write a C or C++ program or a Java program trying to know how many registers there are or how big the memory is. Yet people have traditionally written parallel programs to be very tightly tied to the hardware. We need to make the process much more abstract so that the software endures.

Why did you go specifically with C++?

First of all, C and C++ have a lot of very important programs already written in them that will benefit a lot from parallelism. We're not going to all wake up tomorrow and switch to new programming languages-we'll retrofit the ones we have. The other attractive thing about C++ is that it can be extended using templates, which are defined in header files. We've made heavy use of templates in C++ to implement TBB. They are especially useful in generic programming, where you write algorithms that can be applied to a lot of situations-like a sort algorithm that doesn't care what the data elements are. Templates have helped TBB to fit seamlessly into C++, which is what a C++ programmer would expect.

Why did you open source TBB?

We did it to address two customer concerns. The first was that people wanted to know if TBB would be ported everywhere they wanted want it. We've implemented TBB on Windows, Linux and MacOS, including a PowerPC version for older Macs. We've covered Intel processors, and by virtue of doing so, AMD processors as well. Nevertheless, portability remained a concern, and some companies also worried about long-term Intel support and becoming too dependent on a single vendor. Open sourcing addressed both of these issues. It was only a month after we did so that the Autodesk Maya, which is used in Hollywood for special effects, was released using TBB. Other companies have done so, as well.

One unexpected TBB port was to the Xbox.

That's a great story. When we opened the code, we weren't sure how many people would actually do a port. Then a Ukrainian game developer called Deep Shadows ported TBB to Xbox. They were interested in taking advantage of the parallelism found both on the Microsoft console and on PCs. The first we heard of it was when they asked us if we'd like to include the code in the project. It was a complete surprise to us, but we said yes, absolutely. It's a great example of the power of open source. Beyond porting, most of the community contributions have been in the form of feedback-and that's been enormously useful.

How does TBB fit in with the Intel Parallel Composer compilers and libraries? Threading Building Blocks will be included as part of Composer. We view TBB as such a fundamental part of doing parallelism in C++, that it's a natural fit for us.

What about the forthcoming Larrabee processor?

Larrabee is a many-core processor. It's aimed at graphics and it is highly parallel. We expect that when you are writing code on Larrabee, TBB will look inviting. We don't have Larrabee hardware yet-that's coming later this year. But we've done simulations and we think that TBB is going to be an important component to programming Larrabee. While some people will view Larrabee primarily as a graphics processor for use with DirectX or OpenGL, other programmers will chose to program on Larrabee itself, including game development. We'll have to see in the coming years how successful TBB can be-but I don't see anything else that excites me more right now. Meanwhile, we're asking for feedback on any additional features that could make TBB more attractive for Larrabee development.

TBB seems to get compared with another open standard: the OpenMP API.

It gets compared a lot. When I got first got involved, people who are familiar with OpenMP would tell me that there's no advantage in going to TBB. My response is that if you aren't using either, you should probably go with OpenMP if you want keep things as simple as possible. But keep in mind that we wrote TBB because we saw limitations of OpenMP that we didn't see being overcome. The main limitation is that OpenMP fits Fortran and C pretty well, but was not written with C++ in mind. OpenMP can't help you with the data structures I was talking about. It doesn't have any concept of replacing memory allocators or giving portable locks.

Here's my advice. If you are a C++ programmer, you should use TBB. If you are a Fortran programmer you should use OpenMP. If you are a C programmer, then you need to look at whether you can get everything done with Open MP. Jumping to TBB as a C programmer doesn't require that you have a deep understanding of C++, but it does require not only that you have a C++ compiler, but that you be tolerant of C++ syntax in your code, because TBB uses C++ templates.

OpenMP has outperformed TBB in some areas. Is TBB is catching up?

TBB is written to do dynamic scheduling of tasks to threads, whereas OpenMP was aimed at scientific programs that have very well behaved loops that can be statically allocated. If, for example, you have a loop that goes from one to a million-you could simply give a quarter-million to each of four cores. Whereas TBB might give an eighth of a million to each core and when they are done, give a sixteenth of a million to each, and so on, because it is constantly thinking about load balancing. That turns out to be a better algorithm for a general purpose PC that might also be running other programs at the same time. But in a dedicated environment, the slight overhead would make OpenMP faster.

But TBB is indeed catching up. With the 2.1 release this summer, we improved the way TBB uses memory caches to take better advantage of data left over from the prior loop. In some cases, we've seen fantastic speed-ups: one benchmark ran eight times faster. Now, we are not aware of any programs that run faster using OpenMP in a real world situation than with TBB.

As a parallel computing evangelist for Intel, you travel a lot. Who do you talk to?

Usually when I travel I have a specific purpose-like attending a conference. But then I usually allow myself three or four extra days to make customer visits. That's been a fun part of the job because I never know quite what to expect. Sometimes, we get people who are brand new to parallelism; other times, the room is filled with deep experts. When I was in Australia last week, I got to talk with people who have been doing heavy duty scientific analysis for decades. One group is hoping that Australia will eventually win a bid to build a square kilometer radio telescope. On the flip side, I meet with customers who aren't yet doing any parallelism, and I try to explain it to them in a way that doesn't run them over.

Have you been to Japan?

Yes, a few times, including a couple of days at Sony headquarters where I spoke with a gathering of engineers. There's been a lot of Japanese interest in TBB and my book has sold very well there in translation, which delighted all of us. In addition, someone in Japan wrote his own book on TBB. We're thrilled about that: Japan is the only country where that's happened. To my knowledge, he never spoke with anyone at Intel, and if any of your readers know who he is, I'd like to hear from them.

With TBB and other tools making parallel development easier, where do you see the technology going?

At a really high level, I think we have the opportunity to make computers more usable. I should say here that some technologists disagree-arguing that anything done in parallel could have been done as well with a really fast single core. But I think that, in hindsight, a lot of sequential programs were written that way only out of convenience, not because that was the best approach.

For example, I may have 1,000 emails in my inbox. But the mail program will still freeze while it tries to download two more from the server, not letting go until it is convinced it is synchronized with the server. But consider that if I had received 1,000 pieces of mail from the Postal Service, adding two more would not interfere with my ability to read the ones I have. That kind of parallelism is implicit in our lives. We also have parallel processing going on in our heads. We "pre-process" visual and auditory information at the same time. But most computers that do voice recognition couldn't, until recently, do something else simultaneously. Computers don't tend to prioritize their user interface, even though the human sitting in front of them ought to be given the priority. Parallelism will help.