Linux Clusters: Supercomputing for the Masses

According to the latest "Top 500 Supercomputer" list (www.top500.org), the fastest computer in the world is Earth Simulator. Residing at the Earth Simulator Center in Kanazawa, which predicts global climatic changes, the NEC-built system represents a performance breakthrough: 35.86 trillion floating point operations (teraflops) per second, peak performance---almost three times faster than the runner-up and former champion, Lawrence Livermore National Laboratory's ASCI White, manufactured by IBM. Earth Simulator is not only fast, but big. Its 640 nodes, each with eight vector processors, are arrayed on three floors, consuming the area of four tennis courts. While NEC hasn't disclosed the cost, a competing IBM system estimated for completion in 2009 is projected to cost $224.4 million.

But as fast as the NEC machine is, its architecture represents but one version of the future. A look down the list shows that of the top 50 supercomputers, 10 are cluster systems, and two of those are Linux-based: the Universitaet Heidelberg's Megaware-developed cluster, ranked 35th, with .825 Tflops/s, and the Cplant machine at Sandia National Laboratories in New Mexico, ranked 50th with .706 Tflops/s. Just a few years ago, the thought of a Linux cluster on the Top 500 list would have been laughable. Now the number is growing, with a top-10 ranking possible next year.

"Linux clustering is by far the most interesting and exciting project I've been involved with," says Rob Pennington, senior associate director for computing and data management at the National Center for Supercomputing Applications. "Clustering combines the best of what's happening at the grass roots level, in the research labs, and in the commercial marketplace," he says. When Pennington first got involved with clustering in the late 80s, developers were immersed in the demands of parallelization, "hand coding every little thing that happened on each machine and spending enormous amounts of time trying to do synchronization coordination." Now, the parallelization task has been largely automated and the application development foundation has stabilized, making porting easier.

Built in 1997 and later expanded, Cplant---the current U.S. Linux cluster leader--- has been called an "off-the shelf" supercomputer. Compared to Sandia's Intel-built ACSI Red system (ranked number 7 with 2.38 Tflops/s), Cplant is a garage project, the equivalent of a private jet powered by multiple Volkswagen engines. But Cplant also has a different mandate within the lab. Like other ASCI machines at the U.S. national laboratories, ASCI Red primarily simulates nuclear testing. By contrast, the lab views Cplant as a cost-effective platform for what it calls "capacity computing." "On the ASCI machines, we typically run very large simulations---using 512 processors and beyond," says Ron Brightwell, principal member of Sandia's technical staff. "These applications require lots of nodes and all the resources that the machine can deliver. With capacity computing, we're talking about smaller scale jobs using less than 512 nodes, and having shorter run times."

The range of applications hosted by Cplant includes biomedical applications, genetic algorithms, chemistry codes, and fluid dynamics. "For a lot of the academic and research institutions that are not looking at ultimate scale kinds of platforms, the Linux clusters are going to be the platform of choice," Brightwell says. "They have been proven to be very cost effective, and the Linux community is quickly developing the infrastructure install and use them."

The world's fastest Linux cluster

What's striking about Linux clusters is how much performance they've delivered using an open source operating system running on commodity hardware. "When I started in supercomputing there was Cray and there was Cray," said Dave Dixon, associate director of theory, modeling and simulation at Pacific Northwest National Laboratory (PNNL), at a conference call last April. He was helping to announce what will be the world's largest Linux computer, a 1,400-node cluster composed of Intel McKinley and Madison processors, developed by Hewlett-Packard.

PNNL's Linux cluster is slated to be up and running in 2003, and is expected to exceed 8.3 flops/sec---which would give it the number two ranking were it in operation today. Costing $24.5 million, it will be the world's largest Linux cluster. "This will be the first really large Linux cluster to be deployed in a production environment," says Scott Studham, technical group leader, computer operations, at PNNL's molecular science computing facility

"The brainpower behind the open source community achieved critical mass a few years ago and it is a huge wonderful beast in and of itself," Studham says. "With Linux, we can choose when we want to do an install, rather than having a vendor dictate the timing. We can select from four or five different installation processes, and we can see the source. If we don't like something, we can change it. And we do." This is PNNL's second Linux cluster. The first, a 240-processor machine, was considered a one-year experiment. Within months, it proved successful enough to become a production machine.

PNNL uses the Global Arrays toolkit to simplify development by presenting distributed memory as a global address space. On its current Linux cluster, workloads are managed with the Maui Scheduler, but the HP system will use Platform Computing's Platform LSF. PNNL will use the new system primarily for computational chemistry tasks, running on its own NWChem package. Projects will include simulating the reaction of radioactive material stored at the Department of Energy's Hanford, Washington site (Hanford is currently engaged in the world's largest cleanup of nuclear wastes) and simulating protein interactions for cancer research. Studham wasn't ready to release performance numbers on a 44-node pilot project. "But they look awesome," he says.

Beowulf and its progeny

All Linux clusters are descendants of a single machine, called Beowulf, developed by Thomas Sterling and Don Becker in 1994. The two worked for the Center of Excellence in Space Data and Information Sciences (CESDIS), a division of the University Space Research Association, which is located at the Goddard Space Flight Center in Greenbelt, Maryland. Beowulf used 16 AMD DX4 processors and custom Linux Ethernet drivers, which enabled network traffic to travel across two or more separate Ethernet networks---an advantage in the days of 10 Mbit/s Ethernet.

Beowulf addressed the traditional access problem with supercomputers, in which an abundance of compute cycles go to a relatively small number of competing projects. Beowulf and its progeny have become a low-cost alternative---a sort of supercomputer for the masses---simple enough for hardware-savvy engineers to build themselves. By 1996, both NASA and the Dept. of Energy were demonstrating clusters that cost less than $50,000 and provided more than a gigaflop/sec of performance. The Beowulf project is now hosted by Scyld Computing Corp., which is marketing a cluster operating system for commercial application.

At Los Alamos National Laboratory in New Mexico, Wu Feng and colleagues Michael Warren and Eric Weigle have developed a variant on the Beowulf cluster called "the bladed Beowulf." While acknowledging the cost-savings advantages of the architecture, they argue that Beowulf cluster is hardly free. "Our own 128-CPU VALinux cluster, dubbed Little Blue Penguin (LBP), took days (and arguably weeks) to install, integrate, and configure properly and initially required daily intervention and maintenance by the technical staff," they wrote. "Even after the system stabilized, the LBP cluster generally required weekly to monthly maintenance due to the lack of reliability (or robustness) of the commodity off-the-shelf hardware and software; this included mundane tasks such as rebooting the cluster due to a system hang or system crash."

The new machine, called "Green Destiny," is an attempt to raise downtime by lowering the heat. It takes the radical step of using 240 667MHz low-powered Transmeta CPUs. Transmeta chips trade performance for low heat disposition and have been employed by Fujitsu and Sony in laptops. The processors are mounted onto half-inch motherboards or "blades." Twenty-four blades are mounted on a chassis. Ten chassis, together with network switches, are mounted in a standard computer rack.

Performing at a peak rate of just .16 Tflops/s, Green Destiny scarcely qualifies as a supercomputer. But it uses just ten percent of the electricity and twenty-five percent of the space of comparable clusters, and that, says Feng, translates into higher reliability. Indeed, Feng argues that raw benchmark comparisons are misleading. Downtime must also be factored in---and so far, Green. Destiny has proved remarkably stable. "If you want to solve a problem that you can complete on a traditional supercomputer faster than the mean time between failures, I guarantee we'll run slower than a traditional supercomputer or virtually any other cluster," Feng said in a statement. "But if you're running something that takes weeks or months, eventually the stable machine will win the race." Feng argues that the heat dissipation worsen as chips grow more dense, following Moore's law. He predicts that the microprocessor of 2010, crammed with more than a billion transistors, will emit over one kilowatt of thermal energy---more energy per square centimeter than a nuclear reactor.

In an e-mail correspondence, Feng wrote that a Bladed Beowulf cluster is composed of "cheap (high volume) commodity parts that can be purchased by common folks to build their own "supercomputer---it's do-it-yourself supercomputing in a small space." He said that "a Bladed Beowulf compute node is different from a traditional Beowulf compute node in that the former fits on a motherboard that is roughly 0.5" wide, 5" tall, and 15" deep while the latter typically measures at least 19" wide, 1.75" tall, and 20" deep in a rack-mount configuration. Our Bladed Beowulf requires NO special cooling facilities....In fact, our Bladed Beowulf operates in a warehouse where the ambient temperature routinely hits 80-85 degrees.

"Our first Bladed Beowulf, MetaBlade, effectively took only three man-hours to put together, install, integrate, and get Mike Warren's N-body cosmology code up [which simulates the earth's origins] running. We've had no failures since September 2001."

Linus Torvalds and Gordon Bell, known for his work on the VAX minicomputer, both attended the formal unveiling of Green Destiny. But not all computer scientists are impressed. "If I needed to have a supercomputer for my closet---it's a great," says PNNL's Scott Studham. But he doesn't think the cooling problem requires such a radical solution. "You don't re-architect and remove a half the performance of a $25 million computer because you have a $1 million problem."

Los Alamos spokesman Jim Danneskiold acknowledges that the architecture is still untested. "Once you get beyond 240 processors, how does it scale? That's a question that hasn't been answered," he says. "And, what is the limit to the interconnect network built around the cluster? Can you achieve the kinds of capabilities you need to run large physics and engineering codes, as we're doing on the ASCI machines? This project is still experimental, but the applications they're currently running are what it's good for---including Mike Warren's simulation, which has run on some big massively parallel machines. Warren is happy with it."

Perhaps the Green Destiny will become known as the laptop of supercomputers. Or perhaps the very term "supercomputer" has grown too blurry as the gap between the top-tier and the next tiers grows wider. Some argue that a listing on the Top 500 Supercomputer website is as good a qualification as any for what constitutes a supercomputer-class machine. But NEC's Earth Simulator is 237 times faster than the IBM SP Power3 operated by the Kaiser Foundation, which ranks at the bottom of the list.

No Linux cluster will take challenge Earth Simulator any time soon. Indeed, few system may match the PNNL cluster now on delivery. But the low-cost, build-it-yourself potential, and collaborative virtues of open source of Linux clusters could put high-performance computing in the hand of more researchers. IBM's calls its supercomputing research effort "Deep Computing." By contrast, Linux cluster deliver "wide computing," whose benefits are only now being glimpsed.

Sidebar: A conversation with Rob Pennington, senior associate director for computing and data management, National Center for Supercomputing Applications (NCSA), at The University of Illinois, Urbana-Champaign

The NCSA makes high-performance systems available to researchers around the U.S. Recently, the center has drawn attention from its involvement in the TeraGrid project, which will provide the world's fastest unclassified supercomputer: 13.6 Tflops/s of aggregate power. The project's members also include the San Diego Supercomputer Center at the University of California at San Diego, Argonne National Laboratory, and the California Institute of Technology.

What's your assessment of the role of Linux clusters?

That depends on what you're trying to do. Supercomputing has a number of significant architectures. Linux clusters are among the most interesting in that commodity-based systems are getting fast enough and mature enough, with supporting software that is stable enough, to make a significant computing platform.

We've put together two Linux clusters using strictly commodity hardware and software, plus some open source software with some modifications we've made in the area of scalability that we've given back to the community. One cluster runs Intel Pentium III, the other Intel Itanium processors. The TeraGrid system will also have Itanium II processors, and all of this is based essentially the same software stack. We've been testing Itanium II systems for six months now, and the combination of Itanium II and Linux is proving a very effective computational platform.

What has changed in the scheme of things to make off-the-shelf computing possible?

The speed of the microprocessors---they are going up dramatically. Our first cluster used 333 MHz processors, which were almost competitive with the big iron processors that we had on the floor. Over the last couple of years, those processors are following Moore's law very nicely. The big iron is not able to keep up, so the gap is closing.

As a result, computational scientists can put a few tens of thousands of dollars into Linux clusters at their home institution, with a little bit of help from a couple of students, and have a very capable system.

What kind of speeds can they expect?

That is absolutely application dependent. We've seen Linpack numbers in the 60 or 70 percent efficiency range, and we've seen application numbers that are a third of that. It depends on how well you write the application, whether it fits in the cache, whether you do a lot of I/O or if the application is compute-bound.

The other part of this is the software stack. The compilers are now to the point where they work well with C and Fortran codes, providing a lot of the bells and whistles the applications people need. Given those elements and the growth in the open source system software, we have a very capable, very stable system. That means that application folks can do essentially the same thing they do here remotely back at their home site. We're working with a number of groups, including people from Intel, IBM, Ericcson of Canada, and Oakridge National Lab in Indiana University to package cluster software so that people can just buy the hardware, put this cluster stack on it, and get up and running in an hour or two.

Many Linux clusters are pre-configured by vendors, like those you've received from IBM.

We were in this process very early with IBM. They have learned from us, and we have learned from them. The first cluster we bought from IBM wasn't quite a package. But they came in, did the setup, did the testing, and we worked with them through the entire process. The IBM Cluster 1300 is very similar system to what we put on the floor.

What role has Linux played in clustering?

Linux has been one of the key technologies. If we had to do this with a proprietary operating system---and we actually did try that early on---it won't be nearly as successful. The main problem: a proprietary operating system is compatible only with the packages that a particular vendor is able to supply.

How has the process of parallelization changed with these clusters?

One tool that has made the difference is MPI---Message Passing Interface---the result of a community development effort. It's an API that allows application people to send messages with their application between nodes and between processors in a node. MPI is supported by all the major vendors, with open source packages available. That makes it possible to develop an MPI application on an IBM SP2, and run, say, on a SGI Origin 2000 or 3000 within a cluster. There are some small porting issues having do with differing compiler options, but we have seen people move to the clusters very easily.

Is Japan tracking with the U.S. implementing Linux clusters?

All of the major industrial countries are tracking large scale Linux clusters, including those in Asia and Europe. I was at a conference in Singapore just two months ago where this was the major topic

Over the long run, does Linux cluster technology have the potential to dominate high-performance computing?

I believe it does, though it depends on how well the vendors adopt and support it. But from what I'm seeing, all of the major systems vendors in this area are showing a great deal of interest.

Wu Feng of Los Alamos argues that heat generation will eventually become a stumbling block as processors grow more powerful.

Yes, heat is a major problem, but all the vendors are working on this, and they will solve that problem. Another way of reducing heat is to not have all the machines sitting in the same place. As the TeraGrid project is demonstrating, there is more than one way to get enormous amounts of computing power.

What role does Linux play in the TeraGrid project?

Linux is one of the fundamental enabling technologies for TeraGrid---which will combine Linux clusters at four major sites. By bringing together the capabilities of these four sites in the TeraGrid, we're able to build a community that is working on solving larger scale problems than would fit in any particular institution. Linux clusters are one of the mechanisms to do that. The other side of this is people can build their own Linux clusters in their labs, get their software working, and run it on TeraGrid. Scalability is achieved through a common software and hardware environment. This is the first time computer scientists and research scientists are doing work on the same platform. Vendors, too: we're all trying to reach the same goal. This is a very powerful arrangement and it has just come together because of commodity systems.