Our world is awash in data. Millions of devices generate digital data, an estimated 1 zettabyte (that's 10 to the 21st power bytes) per year. Much of this data gets transmitted over networks and stored on disk drives. Thanks in part to dramatic cost reductions in magnetic storage technology, we can readily collect and store massive amounts of data. But what is all this data good for? Consider the following examples:
- At Wal-Mart, more than 6,000 stores worldwide record every purchase by every shopper, totaling around 267 million transactions per day. They collect this information in a 4-petabyte (that's 4 X 1015 bytes) data warehouse set up for them by Hewlett-Packard. These transactions are a treasure trove of information about the shopping habits of their customers. How much did a $200 discount on large-screen TVs increase sales, and how much did the shoppers who bought them spend on other things? How many copies of the upcoming John Grisham novel should they stock? Based on long-term weather forecasts, how many snow shovels should they order for their stores in Iowa? Sophisticated machine-learning algorithms can find answers to this question, given the right data and the right computing power.
- The proposed Large Synoptic Survey Telescope (LSST) will scan the sky from a mountaintop in Chile, with what can be considered the world's largest digital camera, generating a 3,200-megapixel image every 15 seconds, covering the total visible sky every three days. That will yield 30 terabytes (1012 bytes) of image data every day. Astronomers anticipate being able to learn much about the origins of the universe and the nature of dark matter by analyzing this data.
- Every time we have a CAT scan or an MRI taken, millions of bytes are recorded, but currently they're simply turned into a series of cross-sectional images. Imagine a future where the different components of a knee joint could be identified, compared with the data from previous images, and a plan for knee surgery could be generated automatically.
These are just three of dozens of examples where an ability to collect, organize and analyze massive amounts of data could lead to breakthroughs in business, science and medicine. Of course, search engine companies have already demonstrated this capability for the information available on the World Wide Web, and they've shown they can make good money doing so. But the web is just one of many possible data sources in our world, and search engines are just one form of data aggregation.
Storage Costs Almost Nothing; Analysis Remains Pricey
At one time, the cost of storing large amounts of data was prohibitive. Not any more. Modern disk drives have capacities measured in terabytes and cost less than $100 per terabyte. A digitized version of all of the text in all of the books in the Library of Congress--essentially the totality of all of humankind's knowledge--would only constitute around 20 terabytes. Of course, data in the form of images, sound and video have much higher storage requirements, but still we can think of storage as being an almost-free resource. The challenge is in how to manage and make the most use of all of this data.
At Carnegie Mellon, we've made Data-Intensive, Scalable Computing (DISC) a major focus of our research and educational efforts. We believe the potential applications for data-intensive computing are nearly limitless; that many challenging and exciting research problems arise when trying to scale up systems and computations to handle terabyte-scale datasets; and that we need to expose our students to the technologies that will help them cope with the data-intensive world they will live in. Realizing the promise of DISC requires combining the talents of people from across many disciplines, and Carnegie Mellon--with its strengths in engineering, computer science and many related fields--is uniquely qualified to lead the charge.
If we can fit the entire Library of Congress on 10 or 20 disk drives at a cost of $2,000, what possible technical barriers can there be in realizing low-cost DISC systems and applying them to real-world problems? The core challenge is that communication and computation are much more difficult and expensive than storage. Consider that one-terabyte disk drive that we can buy for $100. Even if we accept the optimistic claims of the drive manufacturer, we can only read or write 115 megabytes per second between that drive and a computer. That means it would take more than two hours just to scan the entire disk. Similarly, it would take almost the same amount of time to transfer that data between two computers connected by a high-speed, local-area network, assuming a transfer rate of one gigabit per second. Transferring a terabyte across a slower link, such as the Internet, could take many hours or even days.
So, if we had a terabyte dataset stored on a single disk, then answering even a simple query, such as the average of all of the values, would require several hours; applying more sophisticated analysis algorithms would be out of the question.
To deal with large-scale datasets, we need to spread the data across many disk drives, possibly hundreds, so that we can access large amounts of information in parallel. These disks also need processors and networking, and so we should incorporate them into a cluster computing system, comprising around one hundred nodes, each having one or two processors, several disk drives and a high-speed network interface. Creating this cluster requires construction of a machine room, with the nodes and power supplies mounted in racks and connected by cables. Our proposed $2,000 storage system for the Library of Congress has now grown into a large-scale computing system costing nearly $1 million.
Programming and operating systems with hundreds of disks and processors is no small task. One problem is reliability. A typical disk drive has a mean "time to failure" of around three years. For the disk drive in my laptop, every three years or so I have to deal with the inconvenience of having the drive replaced and restoring its contents from backup storage. (Hopefully!)
If we have a system with 150 disk drives, we can expect on average that one will fail every week. As we scale up a system, we must anticipate that any of its components--disks, processors, network connections, power supplies, even software--can fail at any time. Rather than stopping the system every time something goes wrong, we must engineer it to be highly fault tolerant, so that the system can continue operating (perhaps with some degradation in performance) despite multiple component failures. On the programming side, people have been trying for decades to use parallel computing to improve program performance, yet it remains high on the list of "grand challenge" problems for computer science. The combination of high performance, high reliability and ease of programming for parallel computing systems remains elusive.
Computing in the Cloud
Instead of building and operating our own clusters, one attractive approach is to make use of cloud computing systems. The idea is to let a dedicated organization take on the task of assembling a large system and making it available to others. For data-intensive computing, this can be in the form of a "virtual computing platform," as exemplified by Amazon Web Services (AWS), where customers can buy computer cycles by the CPU-hour and network-accessible storage by the gigabyte-month. (The other cloud-computing model, referred to as "software as a service," provides network-accessible applications, such as email or contact management. This is a valuable service but not suitable for our needs.) From the clients' perspective, cloud computing has the advantage that they do not need to worry about replacing disk drives, dealing with power outages or updating the operating system software on the nodes with the latest patches. Moreover, they can scale up their computing and storage capacity without building and provisioning new data centers.
On the software side, an open-source framework called "Hadoop," styled after the system created by Google to run its Web crawling and indexing, makes it fairly easy to develop applications that manage and analyze large amounts of data on cluster systems. Yahoo! has been the major contributor to the development of the Hadoop File System. Hadoop handles the difficult issues of coordinating the activities of the node processors and disks to implement a large-scale, reliable file system. For example, most files are stored in triplicate on multiple disks, so that there are backup copies of every file if a disk drive fails. Programmers can write code that operates on data spread across thousands of files via the MapReduce model pioneered by Google.
With this approach, the programmer describes computations to perform on the individual files (map), as well as how to combine the results of these computations (reduce) to produce data in the form of multiple files, with a typical program consisting of a sequence of MapReduce steps. The runtime system then handles the low-level details, such as scheduling the map and reduce tasks on the cluster processors and rerunning any tasks that fail. This frees the programmer from the traditional parallel programming concerns of data placement, scheduling and failure recovery. Software like Hadoop makes it possible to implement scalable and reliable applications running on otherwise unreliable and difficult to manage hardware.
Several companies have made their systems available for use by university researchers and educators. At Carnegie Mellon, we've been fortunate to have access to M45, a cluster system owned and operated by Yahoo!, to foster the growth of Hadoop and other open-source projects. (See "Mapping a New Paradigm," The Link, Issue 3.1.) More recently, Google and IBM have joined forces to make large-scale machines available to U.S. universities under the auspices of the National Science Foundation. A number of universities are making use of the platform provided by Amazon Web Services, and Microsoft is developing systems and programming tools that are well suited for data-intensive computing.
New Models of Research, Training
Why are these companies providing access to their computer clusters? For one thing, universities supply students who will soon be working on these very systems. If students never see anything more complex than a program running on a single machine and datasets of one gigabyte or less, they won't be prepared to support and make use of large-scale, data-intensive computing. They won't have even thought about the insights that can be gained from analyzing terabytes of data.
There's a benefit to universities beyond education. We're interested in working on computing problems where the amount of data is so large that we can't own and operate the machines that would store and process it. Computer scientists, of course, aren't interested in "cloud computing" so that we can provide something as simple as an email service. Instead, we're working on projects that require very large datasets to be analyzed quickly and accurately. For instance, a group of researchers in the Language Technologies Institute, led by Maxine Eskenazi, has been working to gather documents that can be used to teach English as a second language and in English education for elementary-school pupils.
Although there are plenty of documents available on the Web that could be used for reading practice or instruction, making sure they're suitable for different reading levels isn't easy. Maxine and her team had to sift through 200 million Web pages just to collect 200,000 useful documents. Only large-scale, data intensive computing makes that possible.
Another research group, led by Stephan Vogel of the Language Technologies Institute, is improving machine translation of one language to another. The modern approach to translating texts from, say, Chinese to English has been to scan millions or billions of documents in both languages, and then build statistical models that look for certain combinations of words and make their best guesses at what English words and phrases match those in Chinese. The success of these translation programs is greatly improved when they can be trained with more and more data--we're now talking in terms of trillions of words. Without a large, scalable cluster of computers, dealing with that amount of data would be impossible.
To cite a third example, Alexei Efros of the Computer Science Department and the Robotics Institute, along with his PhD student James Hay, has downloaded six million landscape images from Flickr and set up a program that looked for common features. Using geographic data attached to some of those photos, they then created computer models that were able to identify--with a surprising level of accuracy--where the different images were taken.
Beyond making use of existing clusters and software frameworks, there are many important research problems to be addressed to fully realize the potential of data-intensive computing. How can processor, storage and networking hardware be designed to improve performance, energy efficiency and reliability? How can we run a collection of data-intensive computations on the system simultaneously? What programming models and languages can we devise to support forms of computation that don't fit well in the MapReduce model? What machine-learning algorithms can scale to datasets with billions of elements? As a research organization, the School of Computer Science views DISC as a source of a large number of exciting opportunities.
There are advantages that universities bring to research in data-intensive computing that companies cannot match. First, the research we do is completely open. We publish our findings and share information freely, which private industry seldom does. Second, history has shown over and over again that when you give creative people new capabilities, they will come up with new ideas that the originators never imagined were possible. The World Wide Web, after all, was never one of the original applications imagined for the Internet, but its impact has been transformative.
There are also research problems that can benefit society but have no prospect of forming a profitable business. Companies will not profit from basic scientific research in astronomy, for instance, but that research will enable us to deepen our understanding of the universe.
Looking Ahead to Broader Challenges
Within the next five years, more and more research projects at Carnegie Mellon will involve analyzing huge amounts of data on very large-scale machines. Our educational efforts are broadening as well. We've created courses at both the undergraduate and graduate level that give students a chance to run programs on the cluster provided by Google and IBM. We're also involved in several efforts to help the National Science Foundation provide course and teaching materials to get students at other universities involved in large-scale, data-intensive computing.
Data-intensive, scalable computing is revolutionizing our ability to gather and analyze information in all forms. It will lead to new discoveries in science and new forms of entertainment. It will lead to improvements in business practices--and, just like any other disruptive technology, it will transform many different businesses for better and worse. Data-intensive, scalable computing will change the way we think about science--the ability to collect and make use of so much data will change our perspective on the kinds of scientific and medical experiments that we consider attempting.
Data-intensive, scalable computing will change education and research in computer science at Carnegie Mellon profoundly. As our students go off into the world, they must understand this new paradigm in computing. They need to know how to use the technology and fully exploit the capabilities that large-scale computing make possible. At the School of Computer Science, we are committed to presenting our students with these new possibilities and providing them with the right kind of tools and thinking to help this technology advance.
Jason Togyer | 412-268-8721 | email@example.com