Big Data Computing Study Group 2008

The Big-Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society. The first activities of the group were two events, both held at Yahoo! in Sunnyvale, California:

Hadoop Summit, Tues., March 25, 2008.
Data-Intensive Computing Symposium Wed., March 26, 2008, described here

Follow this link to Slides and Videos, courtesy of Yahoo!.

The Data-Intensive Computing Symposium will bring together experts in system design, programming, parallel algorithms, data management, scientific applications, and information-based applications to better understand existing capabilities and to explore future opportunities.

Registration for this symposium is closed, having reached the maximum capacity for the facilities.

Event Sponsors

Schedule

8.00-8.55: Continental Breakfast & Registration
8.55-9:00: Welcome to Yahoo! & Logistics: Thomas Kwan, Yahoo!
9.00-9.30: Data-Intensive Scalable Computing. Randy Bryant, CMU
9.30-10.00: Text Information Management: Challenges and Opportunities. ChengXiang Zhai, UIUC
10.00-10.30: Clouds and ManyCore: The Revolution. Dan Reed, MSR
10.30-11.00: Break
11.00-11.30: Computational Paradigms for Genomic Medicine. Jill Mesirov, Broad Institute of MIT and Harvard
11.30-12.00: Garth Gibson, CMU
12.00-1.00: Lunch.
1.00-1.30: Handling Large Datasets at Google: Current Systems and Future Directions. Jeff Dean, Google
1.30-2.00: Algorithmic Perspectives on Large-Scale Social Network Data. Jon Kleinberg, Cornell
2.00-2.30: Mining the Web Graph. Marc Najork, MSR
2.30-3.00: "What" Goes Around. Joe Hellerstein, Berkeley
3.00-3.30: Sherpa: Hosted Data Serving. Raghu Ramakrishnan, Yahoo!
3.30-4.00: Break
4.00-4.30: Scientific Applications of Large Databases. Alex Szalay, JHU
4.30-5.00: Data-Rich Computing: Where It's At. Phil Gibbons, Intel
5.00-5.15: NSF Plans for Supporting Data Intensive Computing: Jeannette Wing, NSF. The Google/IBM data center: Christophe Bisciglia, Google
5.15-5.45: Discussion: Randy Bryant, CMU
5.30-6.00: Break
6.00-6.30: Reception
6.30-8.00: Dinner. Speaker: The Computing Community Consortium: Stimulating Bigger Thinking. Ed Lazowska, UW and CCC

Event Organizers

Randy Bryant, Carnegie Mellon University
Thomas Kwan Yahoo! Research

Location: Transportation and Parking

The symposium will be held on Wednesday, March 26, 2008 at Yahoo's Headquarters (Building C, Classrooms 4 & 5) at 701 First Avenue, Sunnyvale, CA 94089.

Yahoo! will provide valet parking for those driving to the Yahoo! campus. When you arrive at 701 First Avenue, Sunnyvale, CA 94089, the security people at the gate will direct you to the valet area. We will also run shuttle busses between the nearby hotels and Yahoo! as follows:

Shuttle #1: Sunnyvale Hotels to Yahoo!

07:30 am
Days Inn - Sunnyvale

07:45 am
Comfort Inn - Sunnyvale

07:50 am
Best Western Silicon Valley Inn

08:05 am
Larkspur Landing

08:15 am
Staybridge Suites Sunnyvale

08:20 am
Sheraton Sunnyvale

08:30 am
Yahoo! Sunnyvale Campus
Shuttle #2: Santa Clara Hotels to Yahoo!

07:45 am
Biltmore Hotel Santa Clara

07:55 am
Marriott Santa Clara

08:05 am
Holiday Inn - Santa Clara

08:20 am
Hyatt Regency Santa Clara

08:30 am
Yahoo! Sunnyvale Campus

Note that these are estimated times. The exact timing will depend on traffic and other conditions. Please ensure that your are waiting in front of the hotel prior to the published pick up time. The busses will not wait.

In the evening, we will run two shuttles, departing Yahoo! at 8:30pm and returning to the same locations.

Registration and breakfast will be available starting at 8am, and we encourage you to arrive early to avoid possible delays due to traffic, parking, and registration.

Talk Abstracts

Randy Bryant

Data-Intensive Scalable Computing
Randal E. Bryant, Carnegie Mellon University

Search engine companies have devised a class of systems for supporting web search, providing interactive response to queries over hundreds of terabytes of data. These "Data-Intensive Scalable Computer" (DISC) systems differ from more traditional high-performance systems in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing.

ChengXiang Zhai

Text Information Management: Challenges and Opportunities
ChengXiang Zhai, University of Illinois

Recent years have seen an explosive growth of text data in multiple domains, demanding powerful software tools to help manage and exploit the huge amount of text information. For example, Web search engines are an essential part of everyone's life, biomedical researchers long for text mining tools to reveal knowledge buried in the large amount of literature, and online shoppers would significantly benefit from intelligent tools for opinion integration and summarization.

While relatively mature technologies have been developed for managing structured data by the database community, there are still many challenges to be solved in managing the unstructured text data even though a lot of research progress has been made by the information retrieval community in the past decades. Due to the difficulty in precisely understanding natural language and users' information needs, text information management poses significant challenges and requires collaborative research by multiple communities especially information retrieval, natural language processing, databases, machine learning, and data mining.

In this talk, I will review the state of the art of text information management and discuss the major challenges in developing general frameworks, algorithms, and systems for managing text information effectively and efficiently. I will present several interdisciplinary research directions where multiple communities can be expected to collaborate with each other to generate high impact research results.

Dan Reed

Clouds and ManyCore: The Revolution
Daniel A. Reed, Microsoft Research

As Yogi Berra famously noted, "It's hard to make predictions, especially about the future." Without doubt, though, scientific discovery, business practice and social interactions are moving rapidly from a world of homogeneous and local systems to a world of distributed software, virtual organizations and cloud computing infrastructure. In science, a tsunami of new experimental and computational data and a suite of increasingly ubiquitous sensors pose vexing problems in data analysis, transport, visualization and collaboration. In society and business, software as a service and cloud computing are empowering distributed groups. This talk reflects on current practice, some lessons learns and a vision and approach to solving some of today's most challenging problems via flexible software ecosystems.

Jill Mesirov

Computational Paradigms for Genomic Medicine
Jill P. Mesirov, PhD, The Broad Institute of MIT and Harvard

The sequencing of the human genome and the development of new methods for acquiring biological data have changed the face of biomedical research. The use of mathematical and computational approaches is critical to take advantage of this explosion in biological information. There is also a critical need for an integrated computational environment that can provide easy access to a set of universal analytic tools; support the development and dissemination of novel algorithmic approaches; and enable reproducibility of in silico research.

We will describe some of the challenging computational problems in biomedicine, the techniques we use to address them, and a software infrastructure to support this highly interdisciplinary field of research.

Jeff Dean

Handling Large Datasets at Google: Current Systems and Future Directions
Jeff Dean, Google

Over the past several years, we have built a collection of systems and tools that simplify the storing and processing of large-scale data sets, and the construction of heavily-used public services based on these data sets. These systems are intended to work well in Google's computational environment, which consists of large numbers of commodity machines connected by commodity networking hardware. Our systems handle issues like storage reliability and availability in the face of machine failures, and our processing tools make it relatively easy to write robust computations that run reliably and efficiently on thousands of machines. In this talk I'll highlight some of the systems we have built, and discuss some challenges and future directions for new systems.

Jon Kleinberg

Algorithmic Perspectives on Large-Scale Social Network Data
Jon Kleinberg, Cornell University

The growth of on-line information systems supporting rich forms of social interaction has made it possible to study social network data at unprecedented levels of scale and temporal resolution. This offers an opportunity to address questions at the intersection of computing and the social sciences, where algorithmic styles of thinking can help in formulating models of social processes and in managing complex networks as datasets.

Many of the central lines of research emerging here are concerned with tracking and modeling social processes within large datasets, studying how groups form and evolve, and analyzing how new ideas, technologies, opinions, and behaviors can spread through large populations. An understanding of these processes has the potential to inform the design of systems supporting community formation, information-seeking, and collective problem-solving. Moreover, as the research community gathers the kinds of data needed for these studies, it runs into increasingly subtle problems surrounding the privacy implications of these datasets; this too raises a broad range of new research challenges that span multiple areas within computing and beyond.

Marc Najork

Mining the web graph
Marc Najork, Microsoft Research

Web pages and the hyperlinks that connect them can be viewed as a graph. This graph can be mined for many purposes: ranking web search results, identifying online communities, detecting spam web pages, and more. However, the sheer size of the web graph and its ever-evolving nature make such computations very challenging. In this talk, I will describe various link-based ranking experiments conducted over a 17 billion edge graph, and the infrastructure that enabled us to perform these experiments. I will discuss the challenges we encountered (both engineering and theoretical), and offer some speculations as to where web graph mining is headed.

Joe Hellerstein

"What" Goes Around
Joe Hellerstein, U.C., Berkeley

Declarative languages allow programmers to say "what" they want, without worrying over the details of "how" to achieve it. These kinds of languages revolutionized data management decades ago (SQL, spreadsheets), but traditionally had limited success in other aspects of computing. The story seems to be changing in recent years, however. One new chapter is work that my colleagues and I have been pursuing on the design and implementation of declarative languages and runtime systems for network protocol specification. Distributed Systems and Networking appear to be surprisingly natural domains for declarative specifications, and -- given recent interest in new architectures for datacenters, sensor networks, and the Internet itself -- these domains are ripe for a new programming methodology.

As the work on core declarative networking has matured, we have been returning to our roots in data management, with emerging declarative projects in distributed inference algorithms, and a metacompiler that is getting us thinking about data-intensive computations in cluster and manycore settings. This talk will introduce the concepts of Declarative Networking, the state of the research agenda today, and some new directions being pursued.

Raghu Ramakrishnan

Sherpa: Hosted Data Serving
Raghu Ramakrishnan, Yahoo!

Modern web sites are complex applications demanding high performance, scalability, and rich functionality from their data management backends. Most major web applications achieve the levels of scalability and cost-efficiency that they require through backends built on a cluster of relational DBMSs.

Distributed object stores such as Amazon's Dynamo are attractive data-serving back-ends because they scale well and inexpensively, and have long been used in many web applications (e.g., Yahoo! relies extensively on a similar system called UDB for its internal applications). However, data-backed applications are easier to build on an infrastructure that offers richer semantics than a basic file system, and this has led to the development of systems with richer, but still limited, database functionality, such as the rows and columns abstraction provided by Google's BigTable.

Even more powerful database functionality, such as indexing techniques and richer transaction models, is important for many large scale applications---and we believe that these features can be added while achieving scalability and fault-tolerance comparable to simpler object-stores. We are building Sherpa, a massive scale data management service to support Yahoo!'s web applications. The key insight driving our design is that web applications can typically accept lower levels of consistency than full serializability of transactions, opening the way to systematically exploit fine-grained asynchronous replication.

In this talk, I will motivate the challenge of building data-serving systems "in the cloud", and discuss the approach being taken in the Sherpa project, a collaboration between Yahoo! Research and Platform Engineering.

Alex Szalay

Scientific Applications of Large Databases
Alex Szalay, Johns Hopkins University

The talk will present a discussion of issues arising in data intensive science today. We analyze the underlying reasons and trends for the data explosion. We present case studies from several different fields, ranging from astronomy to simulations of turbulence and sensor networks. We discuss the analysis of the usage statistics of the SDSS SkyServer. We describe our experiments at JHU on building a large-scale database cluster based on SQL Server.

Phil Gibbons

Data-Rich Computing: Where It's At
Phillip B. Gibbons, Intel Research Pittsburgh

To use a phrase popularized in the sixties, data-rich (or data-intensive) computing is "where it's at". That is, it�s an important, interesting, exciting research area. Significant efforts are underway to understand the essential truths of data-rich computing, i.e., to know where it's at. Google-style clusters ensure that computing is co-located where the storage is at. In this talk, we consider two further issues raised by "where it's at". First, we highlight our efforts to support a high-level model of computation for parallel algorithm design and analysis, with the goal of hiding most aspects of the cluster's deep memory hierarchy (where the data is at, over the course of the computation) without unduly sacrificing performance. Second, we argue that the most compelling data-rich applications often involve pervasive multimedia sensing. The real-time, in situ nature of these applications reveals a fundamental limitation of the cluster approach: Computing must be pushed out of the machine room and into the world, where the sensors are at. We highlight our work addressing several key issues in pervasive sensing, including techniques for distributed processing and querying of a world-wide collection of multimedia sensors.

Ed Lazowska

The Computing Community Consortium: Stimulating Bigger Thinking
Ed Lazowska, Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington and Chair, Computing Community Consortium

The Computing Community Consortium is an NSF-sponsored effort to stimulate the computing research community to envision and pursue longer-range, more audacious research challenges. In this talk I'll describe the CCC, and outline some of the opportunities for our field.

bryant

Last modified: Thu Apr 24 18:27:24 EDT 2008