Projects

Here is a list of proposed projects. This list will be updated often as new projects come in and will be finalized by January 24. You are free to choose a project from this list, or come up with your own project from the readings you have done in the class.

If you want to do another project you need to meet with me or send me a description, and obtain permission *before* February 1.

Distributed Databases and Publish/Subscribe (2 students)

Contacts: Anthony Tomasic (tomasic@cs) and Anastassia Ailamaki

Interested in very large scale systems that span the globe? The RADAR project is building a research prototype implementation of a new very large scale publish/subscribe system. This project will utilize variety of database, publish/subscribe and constraint technologies (JAVA, SQL, Constraints). The work involved focuses on requirements analysis, design, implementation, testing, and performance measurement of the prototype. A successful prototype and project would lead to additional research work over the summer.

Improving Database Performance on Multicore Processors (2 students)

Contacts: Anastassia Ailamaki and Phil Gibbons

Most future processors will have multiple cores (CPUs) running on a single chip. A common configuration is that each core has its own L1 data and instruction caches but the cores share an L2 on-chip cache. This project will explore ways to improve DBMS performance on such processors. For example, staged databases have been shown to be an effective means for improving the cache performance of DBMS running on a single CPU. (See Natassa's and her students' papers in CIDR and VLDB.) What is the best way to extend staged databases for multicore processors? Should we assign each core its own set of stages? This may have good L1 cache performance but may increase L2 misses compared with more collaborative approaches. For example, can the cache-efficient multicore scheduling algorithm recently proposed by Blelloch and Gibbons in SPAA be used to get better multicore performance for staged databases, or database workloads in general?

Robust Sensor Network Aggregation Schemes (1-2 students)

Contact: Phil Gibbons (phillip.b.gibbons@intel.com)

We proposed a new paradigm for aggregation in sensor networks, called synopsis diffusion, which enables multi-path routing of aggregated partial results (for robustness against message loss) while avoiding double-counting sensor readings (paper in SenSys'04). Previous approaches had aggregated along a tree, which avoids double-counting but is not robust. Recently, we showed that combining both tree and synopsis diffusion (an approach we call tributary-delta) leads to more accurate answers than either approach by itself (paper submitted to Sigmod). This project would explore open questions related to these schemes, such as how to prioritize for more robust handling the sensor readings and partial results contributing to a particular aggregate, how to trade-off message size for robustness, how to provide error guarantees in the tributary-delta approach, etc.

Staged Database Systems: Build TPC benchmarks and compare results to DB2 (1 student)

Contact: Anastassia Ailamaki

The Staged Database System is a revolutionary prototype DBMS that executes queries in a staged (modular) fashion. This project will carry out and demonstrate a feasibility study on the system, in collaboration with Kun Gao. Kun is currently determining the framework for evaluating the system's performance for response time and throughput and completing implementation of TPC benchmarks on the prototype. The interested student will

run, compare, and evaluate TPC benchmarks on the Staged System and on DB2, and
demonstrate the feasibility of the framework and the results through a graphical user interface.

Autonomic Databases: Preparing the schema for the queries (an optimization problem) (1-2 students)

Contact: Anastassia Ailamaki

The AutoPart system aims at partitioning a schema into fragments that help execute queries faster. The questions is, how far can we really partition a schema? Part of the answer is in the AutoPart paper, which implements a set of automatic schema partitioning algorithms on a real astronomical data set. The interested student(s) will begin by working on algorithms to determine efficient horizontal partitioning strategies for the tables in the schema, and then work on modeling changes in the schema and inventing online reorganization strategies to incorporate those changes. The experimentation will be done on an astronomy database and a log of queries on SQL Server.

Comparing real DBMS workload behavior on a modern processor (1-2 students)

Contact: Anastassia Ailamaki

In 1999, I published the first analysis comparing real DBMS workloads on a modern processor. Since then, computer architectures have become more complex and database systems have become more architecture-conscious. What, however, has really changed? Discover the new trends using a Pentium-4 machine and running TPC and microbenchmarks on top of Oracle and DB2 as well as Postgres and Shore on Linux.

Data Mining Projects

Contact: Christos Faloutsos (christos@cs)

Here is a list of projects we share with 15-826 this semester. Feel free to choose one of these, but make sure Prof. Faloutsos knows about it as well to avoid conflicts.