"Hadoop." It's a funny-sounding name for an operating environment, but nobody's laughing at the results. Combined with Yahoo!'s M45 supercomputing cluster, it's allowing SCS researchers to tackle data-intensive problems involving computer graphics, language translation, and informational retrieval they never considered examining before. "We have datasets that we didn't analyze because they wouldn't fit on our machines," says Christos Faloutsos, a professor in the Computer Science Department who's building models to spot patterns in massive social networks--networks that might have 20 million different users, each of whom might connect to tens or even thousands of other users.
Carnegie Mellon is the first academic institution allowed access to M45, which offers the power of 4,000 processors distributed among 500 Linux-based servers, or nodes. None of those individual nodes is particularly powerful, says Jay Kistler (CS'93), vice president for systems, tools, and services in Yahoo!'s Search and Advertising Technology Group. "They're very generic commodity servers that Yahoo! and many, many Internet companies deploy by the thousands," he says. But Hadoop--the name comes from a toy elephant owned by the son of the principal developer, Doug Cutting--turns them into a massive parallel-computing environment.
An open-source platform written in Java, Hadoop is powered by the MapReduce framework that was developed by Google to run its search engine and data-retrieval systems. In MapReduce, the user defines two different operations to perform on each piece of data--a map and a reduce. Each map function produces one piece of the result; the reduce functions assemble the pieces into the final output file. A master program supervises the processors, telling some of them to run the "map" functions and others to run the "reduces." The master program also breaks the incoming data sets into smaller chunks--about 64 MB each--and assigns each chunk to one of the map processors, which runs its program and passes the output onto the reduce processors. Periodically, the master program pings each processor to make sure it's working. If it doesn't get a response, it reads replicas of the data from one of the other processors and ignores the bad one.
"Failure handling, partitioning data, load balancing--Hadoop does a whole bunch of things that programmers previously had to do ad hoc," Kistler says. The genius of the Hadoop file system is that it can be scaled for anywhere from 50 to 2,000 processors working simultaneously; the ultimate goal, Kistler says, is 10,000.
Using Hadoop and M45 speeds up data processing by "two orders of magnitude," says Jamie Callan, a professor in the Language Technologies Institute. Callan is part of a joint Carnegie Mellon-University of Pittsburgh project called "Reader-Specific Lexical Practice for Improved Reading Comprehension," or REAP, a system for teaching reading skills and testing students on their literacy. Led by Maxine Eskenazi, an associate teaching professor in the Language Technologies Institute, REAP processes millions of Web documents, decides whether they're suitable for classroom use, and classifies them. For every story accepted, however, hundreds are rejected. In the previous four years, using existing technology, REAP collected and considered about 20 million documents, creating a database of about 75,000 texts. "It was tedious and painful," Callan says. But in just four weeks this spring, REAP processed 100 million documents using M45, resulting in a database of 6 million texts that can be used in classrooms.
Callan and Faloutsos are among seven SCS faculty members currently working with the M45. Greg Ganger and Garth Gibson, who hold joint appointments in computer science and electrical and computer engineering, are evaluating the system's performance, while Alexei Efros, Noah Smith, and Stephan Vogel are using M45 to crunch massive amounts of data. Smith, an assistant professor in the Language Technologies Institute, is designing computer models that can learn to parse a large set of sentences in any language and extract a set of grammar rules from them. With his current models, it takes about 15 hours to carry out one round of training on a million sentences. Completing the full training regimen on that much data would have tied up one processor for nearly a year--and that's barring equipment failures. Running the same data through 50 nodes in the M45 cluster took under five days, he says, and because of Hadoop's ability to seamlessly reroute data around bad processors, individual equipment failures didn't slow anything down. "It's elegant, and it's relatively cheap," Smith says.
It's not all as simple as it sounds, because as both Smith and Callan point out, not every existing problem breaks into nice sets of "map" and "reduce" functions. "A lot of our work falls into what we call the 'embarrassingly parallel' category," Callan says. "Those problems are easy to evaluate independently--a lot of text analysis programs have this property. But for some problems, it really requires re-thinking them from scratch." Once you get into the mindset of writing problems as map and reduce functions, it becomes more natural, Faloutsos says, adding with a laugh "the ones that don't break easily into map and reduce parts will lead to (research) papers!"
In March, Yahoo!'s Sunnyvale, Calif., campus hosted a Hadoop summit and a symposium on data-intensive computing attended by more than 300 people, including many past and current SCS faculty and students. "It was eye-opening," Callan says. Now, he says, the race is on to create new and better models for using supercomputing clusters like M45. Hadoop, for instance, requires users to tell the master program how to break up the data, but a new operating environment called Pig tells Hadoop how to break up the data. That frees programmers from a lot of time-consuming work, but it also forces them to think at higher, more abstract levels about how they structure problems, Callan says. A lot of software needs to be rewritten to unlock the full potential of large-scale computing systems, he says. "I expect all of my students will need to learn this computational model, because computing clusters will be a major part of their careers," Callan says.
The need to educate students in the principles of large-scale computing was a major reason why Yahoo! invited Carnegie Mellon to use M45, says Ron Brachman, Yahoo! vice-president for worldwide research operations. "We need to get students and faculty interested both in running applications on the cluster, and in creating new ones," he says. "We tend to focus on the most relevant universities, and the talent of the faculty at Carnegie Mellon in large-scale computing was exceptional." The benefits go both ways; Smith calls access to the M45 a "gift" that's barely been unwrapped this year. "I feel very lucky that my research group gets the chance to use it," he says. --Jason Togyer
Jason Togyer | 412-268-8721 | firstname.lastname@example.org