DISC-Finder: A data-intensive scalable cluster finder for astrophysics

Bin Fu, Kai Ren, Julio Lopez, Eugene Fink, and Garth Gibson

Parallel Data Laboratory, Carnegie Mellon University, 2010. Technical Report CMU-PDL-10-104.


DISC-Finder is a scalable, distributed, data-intensive group finder for analyzing observation and simulation astrophysics datasets. Group finding is a form of clustering used in astrophysics for identifying large-scale structures such as clusters and superclusters of galaxies. DISC-Finder runs on commodity compute clusters and scales to large datasets with billions of particles. It is designed to operate on datasets that are much larger than the aggregate memory available in the computers where it executes. As a proof-of-concept, we have implemented DISC-Finder as an application on top of the Hadoop framework. DISC-Finder has been used to cluster the largest open-science cosmology simulation datasets containing as many as 14.7 billion particles. We evaluate its performance and scaling properties and describe the performed optimization.