Disc-Distances: Analyzing the Distribution of Distances between Galaxies
Copyright (c) 2010
Bin Fu, Eugene Fink, Julio Lopez, Christos Faloutsos, and Garth Gibson
All Rights Reserved.

You may use this code without fee, for educational and research purposes.
Any for-profit use requires written consent of the copyright holders.

Version 0.2
Date: 2011-03-05
Main Contact: Bin Fu(binf@cs.cmu.edu)

1. General Information
In this package we provide a hybrid method to calculate the Correlation 
Function: a standard cosmological application for analyzing the distribution 
of matter in the universe. Simply put, given a 3-d point set and several 
range queries, this code calculates how many pairs of points fall into each 
range.

The brute-force technique is O(N^2) where N is the number of points. We 
discover that using sampling, combining with a hybrid method of brute-force 
and kd-tree techniques, quickly achieves a high precision.

In this version of the code we provide our initial sequential implementation. 

2. Environment
Only Java is needed to run our code. We use Java 1.6 to develop and compile 
our code.

3. Source code information
1) List of files
Inside DiscDistance.zip:

CF.class         : Main program
200kb            : A sample input file that contains 200,000 points.
                   Input file sequentially stores the information of each
                   point, where each point is represented by three 8-byte
                   double values: its coordinates in 3-d space.
README.txt       : Readme file.
LICENCE.txt      : ASTRO-DISC licence file.

4. How to run the program
After extracting DiscDistance.zip, Please run the code as:
java CF largeFile inputfile error small big multiplier
<bruteforceNum kdtreeNum> outputfile

largeFile: A binary flag indicating whether to use (largeFile=1) the new data
           loading mechanism which is intended towards large input file, or use
           ordinary loading mechanism (largeFile=0). Please refers to section 5 
           for more details.

inputfile:  Name of input file.

numPt:  Number of points in input file.
 
error: 	Determine the desired precision of the calculation. For example, 
        error=0.01 indicates that the result would be within 1% of true value.
        The less the error, the more time it will take to process.

small, big, multiplier: These three parameters determine the ranges of queries
                        The length of each query is the same under log-scale.
                        For example, if we'd like to calculate Correlation
                        Function on three ranges (0.1,0.2),(0.2,0.4) and
                        (0.4,0.8), We would set small=0.1, big=0.8,
                        multiplier = 0.2/0.1=2

bruteforceNum, kdtreeNum: Initial number of sampled points of brute-force and
                          kd-tree techniques. These two parameters are
                          optional:
                          Please either set both of them or none of them.
                          Default value are 1000 and 3000 respectively.
                          Basically, if you calculate very small query values
                          or very large query values, increase kdtreeNum;
                          Else, increase bruteforceNum

outputfile: Name of output file. Each line in the output file contains a
            number, which is the result for a range query.
            For example, if small=0.1, big=0.8, multiplier=2, then the
            outputfile will contain three numbers. The first number indicates
            how many pairs of points are within (0.1,0.2), the second
            (0.2,0.4) and the third (0.4,0.8)


For example:
java CF 1 200kb 0.005 0.1 0.18 1.2 out

5. Added/changed functionality in this version.

* We add a new loading mechanism to our code, which is intended to load very 
  large input file. Our DISC-Distance used a sample-based method so it can 
  process very large input (trillions of points). Specifically for this 
  sequential implementation, we have tested a 200 million points dataset 
  (5GB on disc).

  User can choose whether to use the new loading mechanism (by indicating
  the parameter largeFile). If largeFile is set to 1, then the new mechanism 
  is invoked. Generally speaking, if the input data is very large (GB level),
  user has to use the new loading method since it's hard to load everything
  into memory; but for small input files, the new loading mechanism will be
  slower than the ordinary method.

* In order to include the new loading method, we convert the input format
  from ASCII format to binary format: each object is stored in 24 bytes: 3 
  double values in Java.

* We tackled a bug in the previous version and reduced unnecessary input 
  parameters.

* We are working on a parallel version of the DISC-Distance by Hadoop.