
================================================

-=[ Summary ]=-

This is a set of Perl code that provides a framework for
calculating fractal dimensions.  Some effort has been made
to allow the package to function on large data sets that 
would not fit in RAM -- see DiskWrapper.pm, for instance.


-=[ Manifest ]=-

The files involved are the following.

6 Perl5 modules
    DiskFracDim.pm
    Pair_Count.pm
    Robust_LSFit.pm
    mktemp.pm
    Wrapper.pm


10 Perl5 scripts
    FixPlace.pl
    LineGen.pl
    Pair_Count.pl
    RUNME.pl
    Robust_LSFit.pl
    TriGen.pl
    Trim_Flat.pl
    fd.pl
    to_eps.pl
    to_png.pl


Licensing terms, in LICENSE.

Some more detail about pair counting, in 
Pair_Counting.txt.

This README.



-=[ Note to prior users ]=-

The packages have been reorganized for increased 
modularity.  The DiskFracDim.pm interface itself is 
mostly unchanged, however.


-=[ Dependencies ]=-

A working Perl5 installation is required.



-=[ Quick demo ]=-

On a Unix system with perl5, Gnuplot, and Ghostview, you
may get good results by trying './RUNME.pl', which will
perform everything noted in the following section and do
a brief test.


-=[ Immediacies ]=-

The end-user will need to decide an appropriate place to
place the packages, and then change or remove the 
directory in the 

  use lib '/usr0/lw2j/private/Research/FracDim';

lines in 

   DiskFracDim.pm    
   Pair_Count.pl
   Pair_Count.pm
   Robust_LSFit.pl 
   Trim_Flat.pl
   fd.pl
   to_eps.pl
   to_png.pl


to wherever you copied 'em.   On a Unix system, the 
'FixPlace.pl' script can change the 'use lib' lines 
for you; on other systems, it may take modification.

  1) Install the entire package -- modules or scripts --
     somewhere.
  2) Make sure they're writeable.
  3) Run FixPlace.pl, which will edit the 'use lib'
     sections.

  The script can then be moved, copied, linked to, or
  so forth, anywhere else on the same system since it
  now has a currect module path.

The scripts should be able to find Perl if it's in the
path under the name 'perl5'.  If not, the first line
may need to be changed.  Likewise, change may be 
necessary on non-Unix boxes.




-=[ Usage ]=-

The script fd.pl provides a very basic tool for using
the package.  Running it without any arguments gives
simple usage instructions.

See the 'Module organization' section for details on
what does what.  Of most interest, perhaps, is the
Pair_Count.pm module.



-=[ A simple demo ]=-

Use TriGen.pl to generate a simple Sierpinski data set --
say, 10,000 points.  
  
      ./TriGen.pl 10000 > sierpinski.10000

Then, compute its fractal dimension using default 
parameters.

      ./fd.pl sierpinski.10000

Were the set infinite, the true value would be log_2(3),
or approximately 1.58; this implementation claims
approximately 1.633:

      Fractal Dimension = 1.63347865762628
      Slope             = 1.63347865762628
      Y Intercept       = 26.6051164100033
      Correlation       = 0.999991103831911

The slope is identical to the fractal dimension estimate
since the exponent, q, was unspecified and defaulted to
2 -- the correlation fractal dimension.  Were we to try

     ./fd.pl -q 0 sierpinski.10000

testing on my box results in

      Fractal Dimension = 1.59986234710169
      Slope             = -1.59986234710169
      Y Intercept       = 0.426860069186125
      C orrelation       = -0.999959669264706


which yields a very good estimate for the Hausdorff 
fractal dimension, and one where the slope is NOT the
estimate.  With the notable exception of q=1, the 
estimate should always equal the slope divided by
(q-1).  The y-intercept is only provided in case one
wants to plot the best-fit line, while the 
correlation coefficient is a measure of linearity in
log-log space; ideally, the magnitude would be 1.


Going back to q=2, we can specify more radii:

      ./fd.pl -r_count 159 sierpinski.10000

which yields

    Fractal Dimension = 1.61496963077228
    Slope             = 1.61496963077228
    Y Intercept       = 26.5789796789311
    Correlation       = 0.999957229435449

Or, perhaps, a different range -- [0.00001, 1]:

     ./fd.pl -r_min 0.00001 -r_max 1 sierpinski.10000

resulting in

    Fractal Dimension = 1.6093375128502
    Slope             = 1.6093375128502
    Y Intercept       = 26.5505443178681
    Correlation       = 0.999892092395302


[*] But see the 'Cautionary Note' below.



-=[ Another test ]=-

We also include 'LineGen.pl', which generates a self-similar
line embedded in 50 dimensions; starting with a line segment 
AB, it divides it into four equal parts and queues the first 
and third segments. 

      ./LineGen.pl 5000 > line.5000

yields 5000 50-dimensional points; 

      ./fd.pl line.5000

on my system yields

    Fractal Dimension = 0.494743043125671
    Slope             = 0.494743043125671
    Y Intercept       = 22.6729890492484
    Correlation       = 0.998321392996083


This meshes well with the belief that the set, were it of
infinite size, would have a fractal dimension of 0.5 for
reasons listed in the generator script.


-=[ Cautionary note... ]=-

The default values are extreme -- $2^{-20}$ to
$2^{18}$ exclusive, with 39 radii as set in 
set_default_params() in Pair_Count.pm.  These values
ensure that every radius is exactly double the previous.

If you change the radius parameters -- the minimum, 
maximum, or number -- it is advisable to set them in
such a way that they form a geometric series where the
multiplier is an integer.  The package will support 
series with non-integral multipliers, *but* doing so
means that the grids do not perfectly nest when 
increasing radii, which in turn removes the 
monotonicity guarantee and may well impair accuracy.

In the provided example, it *does* give results that
match well with the theoretical results.  However, it
need not have been the case.  For the range of 
  
   [ 0.00001, 1]

with the default radius_count of 39 the radius 
multiplier was actually ~1.353876, since
(1.353876^38) * 0.00001 ~ 0.999995.

Obviously, we're constrained to have an integer
number of radii, so it's easier if we choose the 
maximum appropriately compared to trying to guess
possible radii.

Using a multiplier of 2, we note that 2^17 is
131,072; which means that using 18 radii, a minimum
of 0.00001, and a maximum of 1.31072 will retain 
perfect grid alignment.

./fd.pl -r_min 0.00001 -r_max 1.31072 -r_count 18 
        sierpinski.10000

results in 

    Fractal Dimension = 1.60410688680956
    Slope             = 1.60410688680956
    Y Intercept       = 26.545737112734
    Correlation       = 0.999999009839121

This doesn't change the results much.  And, from
testing, using non-integral multipliers usually
*does* provide similar results, so it's still
supported by the modules.  But be warned that if
you get results that absolutely *don't* make 
sense, this is one prime suspect.

An incidental benefit of perfect grid alignment
is that there's an algorithmic tweak that should,
in theory, make it faster; the Pair_Count.pm 
module will iterate over occupied grid cells when
increasing radii, instead of iterating over the
original data.

All this is irrelevant if the slow (-s) quadratic
method is invoked, because that method does not 
do box-counting at all.




-=[ Module organization ]=-

The package is no longer monolithic -- most of the code
that used to be in DiskFracDim.pm has been separated,
for easier re-use for other purposes.  Likewise, it
should be easier for you to replace segments as desired.

The pair-counting code is now in Pair_Count.pm.  It can be
simply invoked from the command line via Pair_Count.pl;
given a filename and perhaps options, the script outputs 
log-log data with one (log radius, log count) pair per
line.

The flat-trimming is now in Trim_Flat.pm, again with a 
simple sample script of Trim_Flat.pl, which reads x-y
points from standard input and prints a subset to standard 
output with 'flat' extremes removed.

The line-fitting code has been moved to Robust_LSFit.pm,
with its corresponding sample script.  The script again
reads x-y points from standard input, and then outputs
three lines, containing the slope, y-intercept and 
correlation coefficient respectively.

Thus, DiskFracDim.pm is now more of an interface, with 
most of the implementation in separate modules now.  A
few functions have been removed from its interface --
these were methods that weren't really necessary to 
export, such as the line-fitting code.  It is, 
technically, no longer essential, since once could do
something like

    ./PairCount.pl my_data | ./Trim_Flat.pl > trim_pairs
    ./Robust_LSFit.pl < trim_pairs > robust_output
    cat robust_output trim_pairs | ./to_eps.pl

which uses Gnuplot to generate an EPS file "temp.eps" 
which includes both the pair-count data and the best-fit
line.  This won't directly give you the fractal dimension;
for that, take the first line in 'robust_output' -- the
slope -- and, if q != 1, divide by q-1.  If q=1, use the
slope as-is.  This division is normally done by 
DiskFracDim.pm, but needs to be done by you if you bypass
that particular interface.
    
DiskFracTest.pl is deprecated; the main script is now 
simply fd.pl.  It provides a very simple example for using 
the traditional DiskFracDim.pm interface, and permits 
access to all the options.

/* OLD:
DiskWrapper.pm and SimpleWrapper.pm are IO wrappers. 
DiskWrapper is the one normally used by my scripts, since
it does *not* need to load the entire data set into 
memory.  Both fd.pl and Pair_Count.pl will use 
SimpleWrapper.pm to support data via standard input
if no filename is given; both also support loading
even specified disk files into memory via 
SimpleWrapper if '-m' is specified.  Be warned that 
neither checks how much memory is actually available.
*/

DiskWrapper and SimpleWrapper are now obsoleted by the
unified Wrapper module, which has the same basic
interface but supports both the DiskWrapper-style
seeking behavior and the SimpleWrapper memory-buffer
behavior.  The latter has two variations -- BUFFER
and BUFFER_PACK.  The latter tries to save memory by
assuming IEEE double-precision numbers and packing
them appropriately.

mktemp.pm is a temporary file generator.  It's used 
solely because Perl does not appear to directly export
a 'mktemp()' syscall.  It attempts to do some very 
basic signal-handling, but if you kill the task 
unexpectedly it's still quite capable of leaving behind 
temporary files -- usually in /tmp on a Unix system.

to_eps.pl and to_png.pl are scripts that, given data
from both Robust_LSFit.pl and Pair_Count.pl, generate
'temp.eps' and 'temp.png' respectively.  These graphs show
both the pair-count data and the best-fit line.  Note that
they use Gnuplot, and thus neither will work if your
Gnuplot binary is not in your path or does not support the
particular output type.

TriGen.pl and LineGine.pl are merely sample data 
generators.  They're very simple, and aren't particularly
aware of the possibilities of exhausting either memory
or disk space.  Both of the sets generated are 
self-similar.

FixPlace.pl is a simple utility script for fixing up the
numerous 'use lib' lines found in this package.  That 
way, you can copy the scripts elsewhere and they'll still
be able to reference the appropriate modules, and you 
won't need to install the modules in any global Perl
hierarchy.

RUNME.pl is a non-exhaustive testing script for the 
package.  If it seems to work without yielding any errors,
that's a good sign.




-=[ Citing this work ]=-

If you wish to cite the enclosed code directly, the following 
might be a reasonable BibTeX entry:

   @misc{Wu2001FracDim,
      author = { Leejay Wu and Christos Faloutsos },
      title  = { FracDim },
      note   = { Perl package available at 
                 [http://www.andrew.cmu.edu/$\sim$lw2j/downloads.html]},
      year   = { 2001 },
      month  = { jan }
   }




-=[ Contacting the authors ]=-

    lw2j@cs.cmu.edu   <=> Leejay Wu
christos@cs.cmu.edu   <=> Christos Faloutsos




-=[ Legalities ]=-

See the 'LICENSE' file.  It's pretty open except for commercial
use, and if you intend that, you should probably be talking to
our university's intellectual property people.


