Multicore Embedded: A New Generation of Architectures is boosting performance without raising temperatures

If you want an idea of where embedded processors are headed, take a look at Texas Instrument's OMAP 3 architecture for the mobile phone market, which was announced last February at the 3GSM World Congress in Barcelona. OMAP 3 devices will store 12 megapixel images with less than a second between shots, handle high definition video, with support for S-video output to a monitor or projector, and support ever more sophisticated video games. They will also, presumably, handle voice.

Max Baron, a principal analyst for the research firm In-Stat, says that the OMAP 3 architecture is a bellwether for the embedded market because "the mobile phone market represents the largest single market for multicore embedded processors-particularly for higher-end cell phones offering video and other advanced data applications." He calls the architecture "staggering."

The OMAP family has been around since 2001, with the first OMAP 2 processor going into production last year. Like other processor companies, TI announces architectures well ahead of "getting to silicon," and the first OMAP 3 chip, the OMAP 3430, will debut in the second half of 2007. "The leap forward is to the next generation ARM core: the ARM Cortex-A8," says Robert Tolbert, TI's product marketing manager for OMAP. "We've also improved the imaging video and audio accelerator, the IVA 2+, so that it supports HD playback and DVD quality camcording."

So while the new generation doesn't represent a fundamentally new architecture, it does demonstrate how embedded processors are becoming complete systems on a chip. And like many new embedded processor architectures, OMAP 3 is multicore-presenting the functionality of multiple CPUs within the form factor of a single chip. Again, this is nothing new: Freescale Semiprocessor's PowerQUICC, which has been around for a decade, combines a special-purpose communications RISC engine with a general-purpose PowerPC processor. The former handles low-level protocol, the latter runs application code." But multicore architectures are growing more capable: with more cores, and more transistors packed within each core.

As for the OMAP 3, it has three cores. The first is the general-purpose ARM processor, essentially a CPU-hosting the operating system and graphical user interface, among other functions. The second is a digital signal processor that accelerates audio and video. The third is a 2D/3D graphics processor, primarily used for video games. From a programmer's standpoint, the operating system runs on the ARM core and everything else interfaces to it via an inter-processor communications link. The application doesn't know or care that it is not entirely running on the ARM. The processor family supports the major mobile phones OSs, including Linux, Symbian OS, and Windows Mobile.

"We were multicore when we introduced OMAP, we're multicore today, and will be multicore tomorrow," says Tolbert. The main advantage is that specialized processors use fewer cycles to get specialized tasks done, which in turns leads to relatively lower power consumption-a goal of virtually every embedded application.

To imagine how the three cores play out on an actual mobile phone, it helps to be Japanese: most Americans and their gadgets haven't gotten to this level of complexity. (Most DoCoMo FOMA handsets use an OMAP processor and Japanese companies were the first to adopt OMAP 2.) Tolbert suggests a scenario in which you are watching a video on the handset, while synchronizing your email over the cellular network, while getting an incoming call. "When your call comes in, your video session doesn't completely drop because your accelerator isn't over burdened doing modem applications or email synchronizations: it is only doing video." While it is tempting to think of this device as a tiny laptop replacement, Tolbert doesn't quite go that far: the processing power is comparable only to a Pentium II and local memory storage is lacking. On the other hand, you can plug in a keyboard and monitor and use your phone to drive a PowerPoint presentation.

Symmetric and asymmetric processing

Multicore embedded processing comes in two flavors: asymmetrical, or "homogenous"-in which each core is designed for a different function, and symmetrical processing, in which the cores are identical and work on a task in parallel. In-Stat's Max Baron says that the asymmetrical processing represented by the OMAP family is the trend for consumer electronics applications, in which ever more features are crammed into ever smaller devices. "Each core is doing more specialized jobs, but doing them with increasing efficiency," he says. This trend, in turn, has resulted in hundred of companies producing multicore chips. Many are licensing the ARM or MIPS architecture, or the Power architecture from IBM. Others have designed their own core in-house.

"There used to be one solution to a given problem-now there are many, with multiple vendors offering slightly different variations on each," Baron says. "And it's significantly easier today than 10 years ago for a company to create its own solution-with tools speeding up the process." The results can be dramatic, with the intelligence for an entire mobile phone being put on a single multicore chip. "That means these companies are moving from being merely chip suppliers to suppliers of entire systems." Just as there are hundreds of kinds of cell phones, there will be hundreds of cell phones-on-a-chip.

Baron points out that while high-end, high-functioning, multicore systems are getting most of the attention, lower-end processors are finding their own niche among a burgeoning market for low-cost hardware. "Before Christmas, I bought a DVD player for just $14. Nobody claims this is a high-performance unit, but for many people, that doesn't matter. A lot of people are not very sensitive to sound and image quality. So if you can get away with a codec that requires less processing power, you can use less costly embedded processors. We are now seeing multiple price points for multicore chips-each with its own customer segment."

Asymmetric processing in embedded applications mirrors the kind of off-loading seen on PCs and servers, where dedicated processors for the disk drives and video card take on tasks that would otherwise be handled by the CPU. "Such heterogeneous multiprocessor systems are also common in many embedded applications," says Dan Bouvier, director of advanced processor Architecture, at Freescale Semiconductor's Networking and Computing Systems Group, in an email exchange. "So the first trend is a continuation of mixing a general-purpose processing function, such as a PowerPC core, with more application-specific offload processors."

By contrast, a processor used for symmetrical applications integrates one or more general-purpose processors, "which maintains software partitioning but still achieves the cost-advantages of system density," says Bouvier. He notes that in embedded applications, traditional single-core instruction level parallelism (ILP) has run its course, replaced by thread-level parallelism (TLP), in which multiple processors run multiple threads. The programming challenge depends on the nature of the software. Some applications are inherently parallel "such that they can readily take advantage of multiple cores. Other applications are not so easy to convert. Luckily, the art of programming parallel processors is not new. What is new is packaging the techniques in such a way that lots of legacy code can be adapted to the newer programming paradigms and end up with a net performance advantage."

Such symmetrical processing is represented by Freescale's MSC812x family. Here, a single multicore DSP can replace multiple discrete DSPs in a board, "thus saving system costs, power dissipation and size, and enabling OEMs to increase channel densities or processing performance in a given system," says Freescale's Barry Stern, marketing manager for multicore DSP products, in an email exchange. "Single silicon die is usually more cost-effective than multiple discrete dies because of the silicon overhead and package and ease of software development."

The MSC812x family embeds four DSP cores on a single silicon die, running at 500MHz, sharing the same packet and external interfaces, as well as both internal memory for applications programs and external memory-with all memory available to all cores. Applications include VoIP and wireless infrastructure, as well as IP-based multimedia services. Parallelization for voice and video applications is done through a single instance of the code running in parallel in all four cores, without the need for load balancing. A single development environment allows the programmer to synchronize the debugging of all 4 cores simultaneously. "The customer only needs to take care of the shared resources of the device, such as memories and external interfaces, and for the tools that are built-in into the device, such as semaphores, core-to-core communications, multichannel DMA, multiple buffer queues, and instruction cache per core," Stern says.

Room for startups

Among the companies playing in the embedded processor space are relative newcomers like venture-funded P.A. Semi, a 150-person company headed by Dan Dobberpuhl, a lead designer of the DEC Alpha and StrongARM microprocessor, among others. The company announced its PWRficient family after two years of stealth development, with its official Asia launch at the NE Embedded Processor Symposium last November in Tokyo. Sample chips will be available in the third quarter of 2006, with single-core and quad-core versions due in early and late 2007, respectively-all reflecting the four-year design cycle typical of the industry. An eight-core version is planned for 2008. Potential applications are typically rack-mounted, include computing, embedded datacom and telecom infrastructure, storage, and other embedded consumer applications.

Founded in July 2003, P.A. Semi is following a much traveled trajectory among new processor design companies-starting with a proven design, rather than creating completely from scratch. The family is based on IBM's Power Architecture and will support embedded applications that are hungry for processing power, but have power constraints (hence the name, which in English sounds like "Power-Efficient.") P.A. Semi estimates that upon shipment, the PWRficient family will offer about half of the power consumption and twice the performance of comparable Power-based chips. And compared to the current state of the art, the chips are about 10 times more efficient: a dual-core chip running at 2GHz dissipates just 5-13 watts, depending upon the application-which could include computing, embedded datacom and telecom, as well as storage. The company also claims the processors will be the first in their class to "integrate what is typically a three- to five-chip-set platform into a single chip" that includes the cores, memory, secondary "southbridge" functions, and high-speed I/O.

Mark Hayter, P.A. Semi's senior director of systems architecture, says that the company saw the Power architecture as the high end of the processing-speed spectrum. "ARM's sweet spot is in much lower devices, with no 64-bit support," As for the x86, he says that integration is lacking. "If you were going the Intel route, you could get a dual-processor core, but you would need a northbridge, a southbridge, and Ethernet interfaces. We're collapsing those chips into one." Hayter says that while most potential applications in this market are for symmetrical processing, some companies are running, say, Linux on one core for control functions and Wind River's VxWorks real-time operating system on another.

Intel dual-core: OS+ RTOS

At the other end of the spectrum, the Intel Core Duo dual-core processor has gotten some traction in industrial control applications, where robotic equipment is increasingly being linked on the Internet. Intel and Apple jointly announced the technology last January, and followed up at the Embedded World conference by announcing "extended lifecycle support" to accommodate the harsh operating conditions of a factory floor, as well as a number of board-level products from third-party integrators. In many cases, these embedded platforms are running a real-time operating system on one core and a conventional OS on the other. Intel supports the Core Duo with the Mobile Intel 945GM Express Chipset, which provides enhanced graphics, I/O bandwidth, and memory.

"The challenge for many of these applications is that the nodes are getting connected over the Internet," says Phil Ames, Intel's embedded marketing manager. "For example, if equipment goes down on the manufacturing floor, that once required deploying someone to the spot with diagnostic equipment." The trend now is to make the fix remotely. "I can be sitting in my office in Phoenix and monitor equipment in plants around the world. I can do the diagnostics, upload firmware, reset systems, and change the OS." At Embedded World, Intel demonstrated how a second core running some flavor of Windows or Linux could run in the background to handle the firewall, virus scan, as well as do data backup and encryption, sending the monitoring information back upstream. "Priority goes to the actual application running on the real-time OS."

Intel is also seeing another use for dual-core in which one core acts as a fail-over duplicate of the other. Both cores run the same operating system and process the same instruction set. If one core fails, the other takes over-presumably skipping over the software glitch that triggered the crash in the first place.

Ames says that dual-core processors can host almost twice the number of simultaneous applications as a comparable uniprocessor, with the same heat dissipation. In a demonstration, the company ran two instances of a compute-intensive application that saturated a Pentium M processor, and four of them on the Intel Core Duo processor-hence doubling the performance with the same "thermal envelope." "It stems all the way back to Moore's Law: roughly every 24 months, we can essentially double the number of cores," Ames says. Doing so dramatically increases the performance vector over a comparable single-core processor, while keeping power dissipation in check.

Meanwhile, back on the server/desktop/mobile side, Intel has also announced a new microarchitecture, called Core, for its multicore products-with the first products due the third quarter of this year. This is a case where advances made in the embedded side of the house are working their way to more conventional CPU applications. In a presentation, Justin Rattner, Intel Senior Fellow and chief technology officer, says that Core's virtues were first seen on the Core Duo processor. Now, they stand to extend battery life for laptops, reduce form factors for servers, and lower household power bills for the family PC. At least here, embedded is becoming unembedded.

Sidebar: An interview with Robert Craig, senior developer, QNX's operating systems group

QNX's Robert Craig says that, as far as he knows, the company makes the only real-time operating system that can run symmetrically. That ability has helped give the company a jump start when it comes to multicore development with the QNX Neutrino MultiCore Technology Development Kit winning top honors at this year's Embedded World in Nurnberg. The company's tools are aimed at symmetrical multiprocessing development-which Craig says is often a better, but less familiar, alternative to the asymetrical model. He made his case by phone from QNX headquarters in Ottowa.

What's the relationship between the multiprocessor on PCs and servers, versus the embedded space?: You're dealing with exactly the same performance issues. The server and desktop markets have been using dual-core processors to increase performance. Exactly the same sort of thing is happening in the embedded space.; The big thing with embedded is the smaller form factor-having multiple processors in a single chip reduces your board area. And there are power consumption issues, as well as thermal issues. Generally, the embedded space has had lower performance because higher clock speeds result in thermal problems. Indeed, the push towards multiple cores on a single chip probably should have been driven from the embedded world, because the performance issues are so significant.
What about the challenges for programmers?: It depends on what mode of operation you are using. If you're running asymmetrically, there are a lot of issues associated with getting things set up and working in a multicore environment with a lot of shared resources. For example, you must take care that the operating system on one core doesn't trample on something that belongs to another core. That means making sure that the hardware is properly partitioned.; One technique we've seen is using Linux on one core and an RTOS on the other-so that your maintenance and open source software runs on one core while the other core does data processing. But when you use two different operating systems, you have to be very careful about how they sit on the chip: one operating system can still corrupt the other OS, taking and the whole chip down. It depends somewhat on the hardware as to how the separation is carried out, but it's certainly an issue that you have to be aware of when you use the asymmetric mode of operation.
Where do tools come in to help?: There's not really not very much in the way of toolsets for asymmetric processing. You're essentially talking about JTAG [Joint Test Action Group-an interface used for chip debugging] to connect the cores and deal with them on an individual basis. You have to really know what's going on at that level to prevent conflicts, and the tools are still evolving. There's still a lot of work to be done at the very lower hardware levels to give a complete picture of how these cores interact together.; By contrast, with symmetric multiprocessing, a single operating system is responsible for all of the hardware on that chip. That means you don't have to worry about resource contention-the operating system takes care of arbitration and prevents applications from doing things that they shouldn't be doing to one another. It's a much more flexible mechanism of operation.
So symmetrical multiprocessing applications are similar, whether embedded or not.: They're pretty similar. The differences come up in terms of power consumption and thermal dissipation. From an operations and programmability point of view, they are almost identical.
Are there differences in the sense that there are fewer resources available in an embedded system?: That depends very much on your application. The Cisco CRS-1 router, for example, has many gigabytes of memory and many, many processors. Generally, when multicore chips are targeted towards high-end applications that demand performance, resources aren't as much of an issue.
What about QNX tools? How do they approach the problem of SMP?: Because our operating system has been SMP capable since its inception, the tools were developed right from the very beginning to be SMP capable. They give you a lot of information: on what CPUs are doing, what CPUs are being used, what percentage of CPU cycles are being used. On multicore environments, the Profiler shows what's happening with threads: the communications that occur, functions being executed at the kernel level, how interrupts are occurring. You see a full picture of what's happening across the whole system, as opposed to just an individual core. That can be critical if you are having applications which are moving from core to core or interacting at some level on different cores.; You also see where performance bottlenecks can occur. For example, we were porting an Ethernet driver from a uniprocessor to an SMP system. The Profiler identified one resource inside our routing stack as a source of contention-it was flip-flopping back and forth between the individual cores. By simply duplicating that resource so that it wasn't a source of contention, we increased our performance by about 10 percent-which for a routing application is very significant.
What's the learning curve for programmers?: A lot of it depends on how familiar a customer is with the multiprocessing model. Cisco jumped on the SMP bandwagon very easily. Other customers are dealing with a more distributed mode of operation, such as two boards connected with Ethernet, which means the asymmetric approach is more familiar. There's an understandable fear of going to a new type of technology. And there is sometimes a lack of awareness of the benefits you get of one mode of operation over the other.
Where have you seen the most interest in multicore embedded development?: It's across the board: defense customers, medical companies, industrial control and automation. Any application in which performance is limited with the current variety of CPUs is a candidate.
Are customers looking for programming outside the embedded space? Or are they doing a lot of training?: It's hard to say. But my bet is that they'll offer training. It's not necessarily that difficult to learn how to do concurrent programming, but it does require a bit of a shift in how you think about how things are done.