On Campus: Super-Sized Recycling

A consortium of universities and government agencies will re-use "pre-owned" supercomputers for student and faculty research

By Jennifer Bails

It's hard out there for a supercomputer.

As soon as you're up and running, you're put to work crunching terabytes of data for computational biologists, astrophysicists and all of the other pushy scientists who expect instant results.

You toil alone for years in a freezing cold room, and after that, do you get any show of appreciation? Yeah, right. They call you slow and obsolete, and you get tossed in a landfill to spend eternity with millions of ordinary cell phones, laptops and other e-waste. That's after they destroy your memory. And then they have the gall to replace you with a less experienced--albeit more computationally intensive and power efficient--machine.

At the U.S. Department of Energy's Los Alamos National Laboratory in New Mexico, up to 5,000 machines from large-scale supercomputers are disposed of in this way each year, according to Gary Grider, deputy division leader of the lab's High Performance Computing Division.

Four years ago, Grider was working to decommission some old supercomputer hardware when it occurred to him there might be a better solution. "I realized our retired machines still had value since they all use Intel architecture these days," he says. "I had this idea that there ought to be a way to reuse these things. One way would be to help systems researchers."

The plight of systems researchers first appeared on Grider's radar screen at a supercomputing workshop, where a panel was asked how the government could help academics do better work on large-scale systems. The answer to that problem and the answer to Grider's disposal problem turned out to be the same: Recycle and reuse.

A new, one-of-a-kind computer systems research center called the Parallel Reconfigurable Observational Environment--or PRObE--has now been established to give systems scientists in academia unprecedented access to large-scale supercomputers.

PRObE is a joint effort of the LANL, Carnegie Mellon and the University of Utah along with the New Mexico Consortium, a partnership between the University of New Mexico, New Mexico Institute of Mining and Technology and New Mexico State University. Made possible through a $10 million National Science Foundation award, PRObE will eventually include at least two 2,048-core clusters to be housed in a research park near LANL, as well as smaller-scale clusters for early testing, including one located at Carnegie Mellon. All of these will be recycled machines donated by LANL.

The first large cluster is expected to come online in late spring.

Access to these clusters fills a pressing need felt by both systems researchers and computer science students, says Garth Gibson, a professor of computer science and electrical and computer engineering at Carnegie Mellon. High performance computing crossed the petascale threshold in 2008 with LANL's RoadRunner, which has more than 122,400 cores, including 6,000 dual-core Opteron chips and more than 12,000 IBM PowerXCell 8i chips, each with many cores. Plans are already under way for the U.S. government to develop an exascale system by 2018 with 1,000 times more processing power than today's most powerful supercomputer. Google is rumored to already have a node count approaching a million spread across many data centers.

Unless they leave universities for government or industry jobs, Gibson says, researchers and students rarely have access to these expensive large-scale clusters. That means they don't get the training and education necessary to develop innovations for the fast-approaching era of exascale computing.

Moreover, when a supercomputer is new, it's immediately needed for applications research, he says, so even when they do get permission to use larger clusters, systems scientists can't run experiments on low-level hardware and purposely break these machines to see what happens.

For example, in massively parallel supercomputers with thousands of nodes, failure is a way of life, not an aberration. The key is developing systems that can continue performing well in a state of near-constant failure, Gibson says. "That's a challenge that has to be solved by systems researchers. We can experiment with the smaller computers we have, but the bigger ones are rushed into production, so we don't have the opportunity to force errors and learn how to handle them."

Researchers will be given dedicated use of the PRObE clusters for days, even weeks at a time. They will be allowed to replace any and all of the code and even inject faults that might be destructive to some equipment.

LANL isn't in the business of supporting academic research and didn't have authority to foot the bill to house, power, air-condition and maintain these old systems, Grider says. "We run a supercomputer complex to do nuclear weapons calculations," he says. "This is an offshoot thing that isn't in our mission."

That's why the New Mexico Consortium--an independent nonprofit managed by three state research universities--was called upon to help put together an application for NSF funding.

The NSF saw value in PRObE immediately, Gibson says, but it took some time to garner support for this large, unsolicited proposal. In that process, Carnegie Mellon was asked to lend its expertise to the project as a renowned leader in computer systems research. And the Flux Research Group at the University of Utah joined the team to adapt its powerful Emulab software to manage the PRObE testbed.

"Emulab is already used to manage about 40 network testbeds, but PRObE will be a unique facility," Utah computer scientist Robert Ricci says. "We're excited to be a part of this effort because it adds important new resources to the public research infrastructure."

If the PRObE pilot is successful, Grider says, it will provide high visibility validation of the need for large-scale systems research in academia and could serve as an example to be replicated by other government agencies.

PRObE also will conduct a summer school to train university students in how to build and manage very large high-performance computing environments; top students will be invited back to the center and LANL as interns.

"I would like to see a whole new generation of computer scientists that have some experience with computer systems research at scale," Gibson says. "Right now they don't begin to get the necessary training to understand the hard problems."
For More Information: 

Jason Togyer | 412-268-8721 | jt3y@cs.cmu.edu