=====================================================================

-=[ Summary ]=-

This archive contains C++ code for scalably estimating the
correlation fractal dimension in constant space[*], and 
with but a single pass through the data.

We note that since it IS an approximate random algorithm, 
we do not recommend its use on unusually small data sets;
for small sets, you might as well use full box-counting
algorithm since the space/time gains may not offset any
accuracy losses from random noise.  




[*] Well, you do have to store one vector at a time, so THAT scales
    with dimensionality.  But other than that, neither the program
    code size nor the number of counters are influenced by either
    dimensionality or vector count.



-=[ Manifest ]=-

Makefile:
  A GNU Makefile.  The GNUisms are actually only used for automated 
  dependency generation; it's the 'form letter' of Makefiles, being 
  rather generic.


RUNME.pl:  
  An automated compilation/testing Perl 5 script.

mktemp.pm:
  A Perl module for making temporary files; used by RUNME.pl.  



DiagEn.pl:
  A script for generating point sets uniformly distributed among
  single diagonal lines.

koch.pl:
  A script for generating Koch snowflake data sets.

LineGen.pl:  
  A fractal line generator used by RUNME.pl.

TriGen.pl
  A Sierpinski triangle generator used by RUNME.pl.


longbeach.inp:
  Sample data set -- the locations of road intersections in Long
  Beach County, California, as expressed in two-dimensional
  coordinates.

mgcounty.inp:
  Sample data set -- the locations of road intersections in
  Montgomery County, California, as expressed in two-dimensional
  coordinates.


adt.h:
  Common types and constants.

array.cc, array.h:    
  A rather generic template array class.  My suspicion is that 
  working with raw double arrays and passing around lengths as well
  might result in faster, if uglier, code.

list.c, list.h:
  Doubly-linked list, implemented using structs and stand-alone
  functions.

main.cc:
  A simple testing program.  It allows changing all the parameters, 
  and does some sanity checking on them; it also outputs useful 
  information such as the estimated fractal dimension, the 
  y-intercept and correlation, and the log/log points that survived 
  a trimming phase.

queue.c, queue.h:
  A struct-base queue.

template.cc:
  A file used solely for instantiating templates exactly once.

tug.cc, tug.h:
  Implementation and interface of the TugApprox class, which takes
  a Wrapper (which provides a bare-bones parser) object and can be
  used to compute the correlation fractal dimension.

wrapper.cc, wrapper.h:
  Implementation and interface of the Wrapper class, which can 
  handle very basic flat files -- one vector per line, with 
  numbers separated by whitespace or commas; comments are 
  indicated by semicolons in a to-end-of-line fashion.

Licensing terms, in LICENSE.

This README.



-=[ Dependencies ]=-
  
A C++ compiler is required, preferably GNU gcc/g++.  Different
versions may possibly result in different behavior with regard
to templates; the version tested was gcc 3.2.3.

Preferably, GNU make, which will automate compiling all the 
.cc files, using ar and ranlib to generate a libtug.a (which 
contains the results of template.o, tug.o and wrapper.o) --
for linking and reuse in other programs -- and linking this 
with main.o to produce the 'fractug' binary.

For testing purposes _only_ -- the RUNME.pl requires Perl,
gnuplot, and Ghostview.



-=[ Immediate things to do ]=-

Ideally, the RUNME.pl will work.  Its objectives are simply
to compile the C++ code into a 'fractug' binary and a 
'libtug.a' library, and then test this with a few data sets.

*) Sets of points uniformly distributed along diagonal lines
   in 2, 3 and 4 dimensions, with 1000 points per set.

*) The Sierpinski triangle, 5k points, 2D embedding.
   This equilateral unit triangle is formed via the 
   pattern

   #         2
   #        / \
   #       4---5
   #      / \ / \
   #     0---3---1

   with 0 anchored at the origin and 1 at (1,0).
   Given (0,1,2) on the FIFO queue, we queue (0,3,4), 
   (4,5,2) and (3,1,5).

   The theoretical fractal dimension of the infinite
   version of this set is approximately 1.57.

   This set can be generated via TriGen.pl.  Note that
   it's a naive queue-based implementation, so it's
   trivial to make it run out of memory by specifying
   a high-enough number of points.

*) A Koch snowflake; generated using koch.pl.

*) A line generated with gaps, 10k points, 50D embedding.

   The pattern here is

   #     0---2---3---4---1

   Given line segment 0-1, we add 0-2 and 3-4 to the FIFO
   queue.  

   The theoretical fractal dimension of the infinite 
   version of this set is 0.50.

   The LineGen.pl script generates this set; and the same
   comments apply as to TriGen.pl.

*) The Montgomery County and Long Beach road intersection
   data sets.


Possible reasons for the RUNME.pl script not working 
include --

1.  A UNIXism on a non-UNIX platform -- for instance,
    the script tries to verify file permissions on the 
    Perl scripts.
2.  Not having Perl, gnuplot, or ghostview.
3.  Not having enough disk space to generate a data set,
    or the temporary input files passed to gnuplot.
    Note that the 50D line set is approximately 8MB in
    size.
4.  The script rejecting an FD value as out-of-range;
    since the random number generator may behave 
    differently on different platforms, it is possible
    that on one or more that the values will yield a
    greater-than-anticipated error.


If the RUNME.pl script makes it to the end, however, 
with a rather high probability everything's working.



-=[ Usage notes ]=-

There are, perhaps, two branches to choose from here.
The first is to use the 'libtug.a' library as part of
a C++ program, making library calls directly.  The
three classes are TugApprox, Wrapper, and Array.

The TugApprox class provides a basic interface to the
tug-of-war approximator.  Included are a basic set of
parameters, a function to estimate the second moments
and generate log-log data, a function to find flat
regions at the extremes of a log-log plot, and a
function that tries to robustly fit a line to trimmed
data.  Incidentally, the line-fitter is prone to 
generate assert() errors if the random noise has so
overwhelmed the data to make it all seem flat, or if
the radius range is chosen badly enough to obtain the
same effect.  If you have a better trimmer or 
line-fitter, simply bypass the frac_dim method and
go straight to the compute_logs function.  

The Wrapper class provides the interface to the data.
Right now, it simply parses the usual CSV or 
whitespace-separated vector file.  Note that while 
there are references to rewind() and so forth, the 
TugApprox methods will make only one pass through the 
data.  It's a simple-enough interface which allows
you to, say, add the capability to read compressed
files if you need to, without changing TugApprox.
It also has a 'masking' feature, by which columns
can be used an arbitrary number of times, including
zero, and in any order.

The Array template class provides a rather generic
resizable array implementation with miscellaneous
forms of syntactic sugar.  It's used primarily to
avoid passing around structures with both a pointer
and a length, and to allow for frequent array
expansion with ease.


The 'main.cc' file provides a sample program that 
allows command-line specification of all parameters.
These are:


  CLI       TugApprox
Argument:     member      Type         Purpose       
=========   =========     ==========   =============================
--silent    silent        boolean      If true, library generates no
                                       output.
--verbose   (none)        boolean      Used by main.cc to decide
                                       whether to report trimmed
                                       log-log points.
--min       radius_min    double       Minimum radius in geometric
                                       series.  Equivalent to box
                                       side in box-counting alg.
--max       radius_max    double       Min
--num       radius_count  pos. int     Number of radii to use.
--s1        s1            pos. int     Accuracy parameter.
--s2        s2            pos. int     Confidence parameter.
--rnd       n             uns. int     Random number generator seed.
--col       (none)        array        It's actually a Wrapper
                                       parameter used to specify
                                       columns to use.  If none
                                       specified, the default is 
                                       all.

The radius settings require the most care; setting a poor 
range may mean missing the "interesting" parts completely, or
having too little resolution and thus too few points for a 
useful linear regression.

Lowering s1 and s2 will increase speed, but at the cost of
accuracy or confidence.  The speed gains come because for
each radius, we generate (s1*s2) random functions, each of
which gets used for a counter... and the final value will
be based on the median of s2 means, each of a set of s1 
counters.


The second approach is simply to use the fractug binary 
as-is via a script, being careful to properly specify 
command-line parameters.  See the table above for 
description of its arguments, and RUNME.pl for a few 
invocations and how one might parse its output.




-=[ Sample invocations ]=-

 ./fractug <data file>

   Report the trimmed log-log points, the estimated 
   fractal dimension, y-intercept, and correlation 
   for the data file.

 ./fractug --verbose <data file>

   The same, except include trimmed points.

 ./fractug --min 0.0048828125 --max 1 <data file>
   Standard, but restricting the radius range to
   2^-20 to 1, which might be useful if you have
   prior reason to believe that no two points are
   more than 1 apart, and in fact may be that 
   close.

  ./fractug --s1 32 <data file>
   Doubling the accuracy parameter from the default of
   16 would double the time required, but may be
   helpful in certain cases.  The radius range, 
   however, is more likely to be critical.

  ./fractug --col [0..4 10..14 ] <data file>
   A normal invocation, but pretending that only
   columns 0-4 and 10-14 exist.


Note that values may vary across platforms due to different 
RNGs.  Hence, while running ./fractug on the sample data -- 
such as that generated via the RUNME.pl script, which uses 
LineGen.pl and TriGen.pl -- *should* get you results that
are close to the theoretical values, at least for non-trivial
sizes, it's difficult to say how close they should be.



-=[ Contacting the authors ]=-

christos@cs.cmu.edu     <=> Christos Faloutsos
alwong@andrew.cmu.edu   <=> Angeline Wong
      lw2j@cs.cmu.edu   <=> Leejay Wu


-=[ Legalities ]=-

See the 'LICENSE' file.  



-=[ Recommended reading ]=-

The approximation algorithm is derived from previous work
on join sizes.  The following BibTeX entry may be helpful.

@inproceedings{Alon99Tracking,
    author = { Noga Alon and Phillip B. Gibbons and Yossi Matias and
               Mario Szegedy },
    title  = { Tracking Join and Self-Join Sizes in Limited Storage },
    booktitle = { Proc. of 18th {ACM} {SIGMOD-SIGACT-SIGART} Symposium
                  on Principles of Database Systems ({PODS})},
    year = { 1999 },
    month = { jun },
    location = { Philadelphia, {PA}}
    pages = { 10-20 }
}



For results using this package, see: 

@inproceedings{Wong03Fast, 
  author = { Angeline Wong and Leejay Wu and Phillip B. Gibbons }, 
  title  = { Fast estimation of Fractal Dimension and Correlation  
             Integral on Stream Data }, 
  booktitle = { Proceedings of the Second Workshop on Fractals, 
                Power Laws and Other Next Generation Data Mining 
                Tools }, 
  year = { 2003 }, 
  month = { August }, 
  location = { Washington, {DC} } 
} 
