A wealth of project ideas can be found in the "future work" sections of the papers in the reading list. For example, you could experiment with alternative routing policies for eddies, either with the goal of improving raw throughput or meeting some user-specified QoS targets. If you go this route, make sure the authors haven't already published a follow-on paper. Another source of project ideas is to consider crossing features between prototype systems. For example, add synopsis compression to SteMs in TelegraphCQ and explore the implications. Another example: implement load shedding from eddies or fjords. General-purpose data stream processing is a young area, so there is plenty of "low-hanging fruit."
If you're interested in epsilons and other Greek letters, see if you can extend the approximate frequency counting algorithm of Manku and Motwani to work over a sliding window. An interesting data set to test it on would be a stream of keyword queries submitted to a web search engine. A useful monitoring task might be to keep track of frequently issued multi-term queries, over a sliding window of the past week. The answer might clue us nerds into the pop-culture hot topics of the week. I have access to real web query data if there are any takers.
Another idea, suggested by Anthony Tomasic, is to conduct a performance "bake-off" between a traditional DBMS and one of the DSMS prototypes, using some benchmark like the Linear Road benchmark. For the traditional DBMS, CQ's will have to be supported either using triggers (PostgreSQL supports triggers) or materialized views (I'm not sure whether these are supported in PostgreSQL, and if so whether any attempt is made to support efficient incremental maintenance). Alternatively you could simulate a CQ by issuing a regular query over and over (one tradeoff to consider here: get the query re-optimized each time, or bypass the frontend overhead by only reissuing the precompiled "canned" version?). Note that the stream benchmark people (drawn from researchers from the three major DSMS groups) are currently working on making such a comparison, so you will be duplicating their work but with much less "manpower." If you are thinking about doing this as your project, be careful -- there are so many variables involved in an end-to-end comparison that any results you get will be highly dependent on all kinds of parameters including the particulars of which DBMS you use, how you hack CQ's and streaming data sources into the DBMS, which DSMS prototype you use, what data/workload you use, how the systems are tuned, etc. End-to-end comparison among full-blown systems is the opposite of a scientific experiment in which you vary one parameter at a time. It will be virtually impossible to separate out the effects of the various aspects, and therefore difficult to make any definite conclusions. For example, if you use TelegraphCQ as the basis for your comparison, how much performance was gained/lost by adaptivity versus sharing versus their process/thread model? Also, if you compare a commercial DBMS against a research prototype DSMS, the commercial system has the unfair advantage of being far more mature. Despite all the cautions, this could be a worthwhile thing to try, to see whether the two approaches (DBMS vs. DSMS) are in the same ballpark as far as performance -- only orders of magnitude will be of any significance due to all the uncontrollable noise, I think.
If you like you may choose to work on a stream-related aspect of your current research (e.g., streaming distributed rendering). All ideas are welcome. The project is entirely open-ended. Whatever you choose, make sure your project is tractable in one semester and has the potential to generate some interesting research results.
Depending on the nature of your project, it may make sense to build your implementation on top of the Berkeley TelegraphCQ prototype, which at this point is only publicly-available data stream management system code-base. (No, it's not a coincidence that many of the example projects above are based on TelegraphCQ.) Alternatively, you may want to build something from scratch (e.g., a stand-alone algorithm for clustering data streams). It's up to you.
Synthetic data is very useful for controlling parameters such as arrival rate, selectivity, etc. In addition to any synthetic data used, I would like you use real data of some sort in your experiments and demonstration if at all possible. You can find data on the web if you hunt a bit. The Lawrence Berkeley Laboratory has some wide-area TCP traces available here. Scientific data such as meteorological measurements (see, e.g., NOAA) can also work, although the data rates tend to be quite slow (e.g., one measurement every ten minutes). You are also free to collect your own data by instrumenting computer equipment to obtain network traffic traces, video game control messages (great for spatial data), etc. Make sure to respect the privacy of any persons involved. If you really have great difficulty procuring a real-world data set that meets the needs of your project, in lieu of real data you may generate synthetic data conforming to the linear road benchmark or some other benchmark/schema described in a research paper on data streams.
Another option is to use temperature sensor data from the CMU SensorNets project. We have access to a repository of around 3GB of data, representing one week of data from 10 sensors sampling temperature at a rate of 10 samples/second. The schema is (sensorID, timestamp, temp_value). An example query might be to detect anomalies by looking for nearby sensors that report wildly different readings at around the same time. This query would involve a windowed self-join between two copies of the sensor stream (joined with an auxiliary static relation to get the position of each sensor), with predicates based on spatial proximity and scalar difference between readings. If you are interested in using this data, please contact the instructor.
A student at the University of Wisconsin has collected some online auction data from eBay (crawled in late 2001). See about.txt. Files: items.dtd; items-snippet.xml; items.zip.