Stardust: Tracking Activity in a Distributed Storage System
Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio López, Gregory R. Ganger.
Carnegie Mellon University, Pittsburgh, PA



Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance metrics. Such traces better inform key administrative performance challenges by enabling, for example, extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper reports on our experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system. Using diverse system workloads and scenarios, we show that such fine-grained tracing can be made efficient (less than 6% overhead) and is useful for on- and off-line analysis of system behavior. These experiences make a case for having other systems incorporate such an instrumentation framework.

BibTeX entry

@inproceedings	{ thereska-sigmetrics2006,
  author	= "Eno Thereska and Brandon Salmon and John Strunk
		   and Matthew Wachs and  Michael Abd-El-Malek and Julio Lopez
		   and Gregory R. Ganger",
  title		= "Stardust: Tracking activity in a distributed storage system",
  organization	= "{ACM}",
  booktitle	= "Proceedings of Joint International Conference on Measurement
		   and Modeling of Computer Systems ({SIGMETRICS'06})",
  month		= "Jun",
  year		= 2006,
  address	= "Saint-Malo, France"